So in my last post, I talked about the six levels to prepare yourself in becoming a data scientist. To recap, the six levels are:
Level 0 - Calculus, Linear Algebra, Statistics
Level 1 - Coding (SQL at least), Data Analysis, Data Visualization
Level 2 - Coding (include R & Python), Data Munging
Level 3 - Machine Learning (Train, Validate, Selection)
Level 4 - Lots of Hands-On
Level 5 - Business Implications of Machine Learning
I did mention in greater detail what you should go for in Level 0 and Level 1. In this post, I will share more details on Level 2 and Level 3.
Level 2 - Coding (include R & Python), Data Munging
After Level 1, you should have some idea what new features to create and how to clean up your data, for instance dealing with outliers and missing values. In this case, the SQL language is not sufficient anymore and you might need R and Python instead. R & Python will be able to provide more sophisticated tools for you to deal with the complexity of data.
For R, namely, you will be using packages from Tidyverse whereas, for Python, you will be working with Pandas and Numpy.
What you will need is to be able to familiarise yourself with the common data cleaning tasks which I have written in another post. Learn to use R or Python for the list of data cleaning tasks and you should be on your way to ensure the data is clean enough for machine learning model training.
Unfortunately (and fortunately), there is no step-by-step way of cleaning data as data challenges are pretty unique in each project. Thus as long as you are familiar with using R and Python to perform the data cleaning tasks, you should be good to go. :)
Level 3 - Machine Learning (Train, Validate, Selection)
You are now equipped with maths and statistics to understand the different machine learning models. Now is the time to apply it, so to speak. Try to understand the mathematics behind each of the machine learning model you have come across. During the study, it will shed light on you what are the strengths and weaknesses of each machine learning model. You will start to learn what are the possible business use cases for each of the machine learning model, what are the data structure that is suitable for each of them, and many more.
After studying each of the machine learning models, you also need a clear understanding of how to choose the "right" machine learning model for the use cases. Here is a tip. We do not always go for the most accurate model. Want to know more? I have written two posts on it. Check them out here and here. You need to have a good understanding of what the model performance metrics are for and which one should we be using to select the best performing model. Should we choose Precision, Recall, MAE, MSE, R-Square etc?
Also, you will need to have a good understanding of the train-validate-test split and how does it affect the model training process i.e. we are being Goldilocks here and trying to get a "just-right" model that neither overfit nor underfit much. We can implement the train-validate-test split easily but more importantly is the decision made here, how does it impact the model training phase.
These are the details I have for now, for Level 2 & Level 3. Will touch on the last two levels in the next post. :)
I wish you all the best and do lookout for my next post, on the last two levels.
Do check out my other blog posts. Keep in touch on LinkedIn or Twitter, else subscribe to my newsletter to find out what I am thinking, doing or learning. :)