Importance of Documentation in Data Science
If you want to be a data professional, then you cannot run away from documentation, no matter how dreaded you feel. It is a necessary "evil", so to speak. So how do we tackle it?
WHY
Let us tackle "Why" do we need to do documentation.
During any data science project, there are a lot of decisions and experimentation that we need to do. For instance, how to fill up the missing value, when needed, and chosen options for the many hyper-parameter we need to tune. The documentation can help us to keep track of the decisions we have made and whether it has worked out, later on when we re-visit back the documentation.
Many areas that need documentation, or keeping track of the decisions we have made, starting with data exploration and munging, how we handled the outliers, missing values, etc.
By keeping track of the decisions that you have made, when you refer back to them, chances are you can learn something from it that can help to improve your future projects. Documenting that thought process, why a certain decision was made for a hyper-parameter, will help your learning later.
What
Let us now tackle the "What" do we document down. Now the list below is what I can think of at the time of writing but you can use it to build towards your 'ideal' documentation.
Data Management & Exploratory Data Analysis
- What is the quality of data?
- How many unique values for each categorical variables? What is the distribution, summary statistics of a single continuous variables?
- What is the primary and/or secondary key?
- How is the training data derived?
- How are outliers and missing data handled?
- What are the data munging steps taken and why?
- What are the relationship of the different features to the target?
Machine Learning
- What is the business question that needs to be answered? What is the associated Machine Learning or Analysis question?
- What is the definition of Target? What is the Value Function?
- What are the models used?
- What are the model performance metrics selected and why?
- What are the features engineered and how they perform?
- What feature selection method was chosen and why?
- What are the hyper-parameters chosen, tested, and how they perform?
- How did each model perform based on chosen performance metrics?
- How was the final model selected?
I hope the list will provide a great start to form your own list. :)
How
Let us now tackle the "How" to do documentation.
Unfortunately, I have not come across a tool that is comprehensive enough to cover the full machine learning lifecycle. If you do, please let me know. Otherwise, you can use the individual tools to keep documentation and audit trail.
What I have seen in some companies is that they use Wiki and Git to document. Microsoft Word or similar online collaboration tool is another possibility but I do not think it will find favor with the Engineers. Commenting on your codes as much as possible with be a great way to document and it is convenient. The downside to that is you may clutter up your codes, making debugging quite challenging.
What are the common tools you used for documentation in your workplace? You can share them with me on LinkedIn.
Conclusion
Documentation is very important and yes, most engineers I come across do not like to do it but for machine learning project to be reproducible, which is critical for learning and improving, it is vital that documentation are done and with the finest details possible.
Good luck and all the best! :)
Together with my Symbolic Connection podcast co-host, Thu Ya Kyaw, we did an episode on Documentation as well. Do check it out!
Do check out my other blog posts. Keep in touch on LinkedIn or Twitter, else subscribe to my newsletter to find out what I am thinking, doing or learning. :)