I  get asked many times “How can I do a good Exploratory Data Analysis  (EDA) so that I can get the necessary information for feature engineering and building machine learning model?”

In this and the next post, I hope to get the question answered. I will NOT claim my process is the best but I hope as more people come into the field, they can use my process as a basis for better EDA and build  better models.

There  are two main benefits of doing EDA and these benefits will reap  benefits through the model building process. The two benefits are:

  1. Have  a good understanding of data quality. We need high quality data to  build good models. I told most of my students and trainees that data is  never clean. We only get it to a quality level that we can use.
  2. Gain some quick insights into the project. Understand what are the potential drivers for supervised learning or possible patterns. These insights can be quick-wins to get more buy-in from other stakeholders.

I will discuss EDA in two posts, non-visual (mainly through simple calculations) and visual. Let us go with non-visual first.

Summary Statistics

Most  of us will remember our summary statistics, such as mean, median, mode  and range. If you need a quick revision, here is the Wikipedia post. Summary statistics allows us to quickly pinpoint if there is any issues with the data. Two examples here.

  1. Checking  the data against business rules/industry regulations. I worked in the  banks before and when it comes to credit card application data, we have  to check the “Age” column to ensure that there are no applications where  the age is below the regulatory requirements. Range (shows the min-max  values) can help us to see that very quickly.
  2. Looking and comparing median and mean, one will understand briefly if the data is left or right skewed.  With that in mind, the next question is to ask whether it makes sense  for the data to have that kind of distribution. One example, I usually quote is income. Income most of the time should be right-skewed (most  people are found at the low income level with few people at the high  income level). If one is to get a normally distributed or left-skewed distribution, there is a need to check on the data quality.

Missing Value

I find that a lot of people, at the start, has this common misunderstanding that missing data is “bad”. That is not entirely true.  One has to bring in the context in order to know if it is “bad” or not.  Here’s an example.

Most  homes used to have landlines (i.e. home telephone) which means if one  is to collect home telephone numbers, it is expected that there should  be a value. Nowadays, with the advent and convenience of smartphones,  most likely people are not going to have a home telephone number to  give. In that sense, missing telephone number is not “bad” then.

When  we are exploring the data, we need to understand how much of the data  is missing, (i.e. out of the number of observations how many are missing  values). This information is to allow us to foresee any possible issues  with the machine learning model, possible imputation method that we can  adopt or maybe even setting up an indicator/dummy variable to indicate  missing value.

Correlations

Correlations helps in understanding which of the current features can be a driver to the target. This, in my opinion, is important given that we  are dealing with so much data and are using deep learning models for everything, it is important for us to limit the features  that we feed into our models. Lesser features allows faster training of  models (assuming deep learning) and also lower maintenance costs.

Another benefit is to anticipate the issue of multi-collinearity.  Having a good understanding of highly correlated pairs of features, allows us to double check on these two features, to understand if its  parameters makes sense or not and if it does not, how do we rectify it.  Of course, one can still use Variance Inflation Factor to identify it, but correlations is just an additional ‘security blanket’.

Cluster Analysis

Once you have done the correlation analysis, one can then select features  that are highly correlated with the target, and perform a cluster analysis (simple K-Means will do) on it. Ideally, it is about 7 to 8 features in addition to the target. As for the number of clusters, it is up to individual to decide and see if an “easy-to-interpret” (i.e. can put a definitive label on  most clusters) clusters can come out.

Cluster  Analysis takes into account of all input features when forming the  cluster, one may see how these features and the target come together.  This helps in providing an additional ‘security blanket’, to ensure that  we have built the ‘right’ model.

Conclusion

These  are the non-visual steps I usually take when I perform Exploratory Data  Analysis. One very important note is that when doing Exploratory Data  Analysis, documentation is very important as these learning are going to  be useful in the downstream process, namely the model training process.

Again,  these are not definitive steps and you may want to add in other steps  as well to supplement your model building process, and if you do have  other steps, please share them with me.

In my next post, I will discuss more about how to explore data using visuals.

(Note: This post was written previously for Medium and this is an edited version of it. Updated as well. Original post can be found here.)

Keep in touch on LinkedIn or Twitter! :)

Do consider signing up for my newsletter to be updated with what I am learning and doing. :)