I get asked many times “How can I do a good Exploratory Data Analysis (EDA) so that I can get the necessary information for feature engineering and building machine learning model?”
In this and the next post, I hope to get the question answered. I will NOT claim my process is the best but I hope as more people come into the field, they can use my process as a basis for better EDA and build better models.
There are two main benefits of doing EDA and these benefits will reap benefits through the model building process. The two benefits are:
- Have a good understanding of data quality. We need high quality data to build good models. I told most of my students and trainees that data is never clean. We only get it to a quality level that we can use.
- Gain some quick insights into the project. Understand what are the potential drivers for supervised learning or possible patterns. These insights can be quick-wins to get more buy-in from other stakeholders.
I will discuss EDA in two posts, non-visual (mainly through simple calculations) and visual. Let us go with non-visual first.
Most of us will remember our summary statistics, such as mean, median, mode and range. If you need a quick revision, here is the Wikipedia post. Summary statistics allows us to quickly pinpoint if there is any issues with the data. Two examples here.
- Checking the data against business rules/industry regulations. I worked in the banks before and when it comes to credit card application data, we have to check the “Age” column to ensure that there are no applications where the age is below the regulatory requirements. Range (shows the min-max values) can help us to see that very quickly.
- Looking and comparing median and mean, one will understand briefly if the data is left or right skewed. With that in mind, the next question is to ask whether it makes sense for the data to have that kind of distribution. One example, I usually quote is income. Income most of the time should be right-skewed (most people are found at the low income level with few people at the high income level). If one is to get a normally distributed or left-skewed distribution, there is a need to check on the data quality.
I find that a lot of people, at the start, has this common misunderstanding that missing data is “bad”. That is not entirely true. One has to bring in the context in order to know if it is “bad” or not. Here’s an example.
Most homes used to have landlines (i.e. home telephone) which means if one is to collect home telephone numbers, it is expected that there should be a value. Nowadays, with the advent and convenience of smartphones, most likely people are not going to have a home telephone number to give. In that sense, missing telephone number is not “bad” then.
When we are exploring the data, we need to understand how much of the data is missing, (i.e. out of the number of observations how many are missing values). This information is to allow us to foresee any possible issues with the machine learning model, possible imputation method that we can adopt or maybe even setting up an indicator/dummy variable to indicate missing value.
Correlations helps in understanding which of the current features can be a driver to the target. This, in my opinion, is important given that we are dealing with so much data and are using deep learning models for everything, it is important for us to limit the features that we feed into our models. Lesser features allows faster training of models (assuming deep learning) and also lower maintenance costs.
Another benefit is to anticipate the issue of multi-collinearity. Having a good understanding of highly correlated pairs of features, allows us to double check on these two features, to understand if its parameters makes sense or not and if it does not, how do we rectify it. Of course, one can still use Variance Inflation Factor to identify it, but correlations is just an additional ‘security blanket’.
Once you have done the correlation analysis, one can then select features that are highly correlated with the target, and perform a cluster analysis (simple K-Means will do) on it. Ideally, it is about 7 to 8 features in addition to the target. As for the number of clusters, it is up to individual to decide and see if an “easy-to-interpret” (i.e. can put a definitive label on most clusters) clusters can come out.
Cluster Analysis takes into account of all input features when forming the cluster, one may see how these features and the target come together. This helps in providing an additional ‘security blanket’, to ensure that we have built the ‘right’ model.
These are the non-visual steps I usually take when I perform Exploratory Data Analysis. One very important note is that when doing Exploratory Data Analysis, documentation is very important as these learning are going to be useful in the downstream process, namely the model training process.
Again, these are not definitive steps and you may want to add in other steps as well to supplement your model building process, and if you do have other steps, please share them with me.
In my next post, I will discuss more about how to explore data using visuals.
(Note: This post was written previously for Medium and this is an edited version of it. Updated as well. Original post can be found here.)
Do consider signing up for my newsletter to be updated with what I am learning and doing. :)