Data Cleaning for Data Scientist
A lot of us might have heard about the urban myth that if you are a data analyst/data scientist, data cleaning (or known as data munging as well) forms 80% of the job tasks with the other 20% being made up of machine learning and analysis. To be honest, I think the ratio is understated, rather it should be 90% data cleaning and 10% machine learning and analysis.
With that in mind, you will notice an anomaly and that is most courses do not teach data cleaning or just briefly touched on. Granted machine learning is the “sexy” part of being a data scientist because that is the part that we feel we are like magicians, turning data into actionable insights.
This blog post is my attempt to help the “greens” have a structured way of picking up on data cleaning. There is an assumption here that the “greens” knows how to do a “good” exploratory data analysis (covered in two posts, visual & non-visual). With the data cleaning skills, “greens” can have the skill to start working on project from any kind of data sets.
Languages
If you want to pick up on data cleaning, you have to learn the tasks that I am listing down later, in the following languages so that you can continue working on your data science projects or portfolio. (I have a blog post on preparing your data science resume and portfolio here.)
- SQL — Also known as Structured Query Language. A lot of companies right now has stored their data in either relational database or NoSQL database where the extraction of data can be done through SQL. Being verse in SQL will at least show the potential employers that, “Hei, at least this interviewee can complete the first step of extracting data from my databases.” For free courses on SQL, here are a few resources, W3 School or Khan Academy.
- R — One of the most common open-source tools out there and it has a huge community. To make it easier for you to learn data cleaning, perhaps you can explore the Tidyverse, and the packages that I used quite often in Tidyverse is dplyr (but there are others such as tidyr & readr). Tidyverse is created by Hadley Wickham, who contributed a lot of popular R packages. His profile is here.
- Python — Another common open-source tool. Python has leapfrogged R in being the go-to tools for data science/artificial intelligence. The main package to learn from is pandas (similar to R there are others packages as well, to do a “good” cleaning).
- SAS — SAS is a commonly used enterprise tool for data science. It is used by bigger organizations and although it is an enterprise tools, “greens” can actually download SAS University to learn more about SAS programming. Note that it is for non-commercial usage.
Common Data Cleaning Tasks
Here is a list of data cleaning tasks that you can try to achieve with the tools stated above. The list of tasks is focused on structured data. If you are able to accomplish the tasks below with the tools mentioned above, congratulations as you are on your way to using machine learning on your data and on route to becoming a valued data scientist.
- Import & export of datasets
- Naming or renaming variables
- Changing the type of variables (also known as explicit coercion)
- Sorting on one or more variables, with duplicate keys or entire duplicate records
- Selecting columns from input dataset to output dataset
- Filtering of rows based on one or more conditions
- Creating new variables through functions of existing of variables
- Conditional processing of variables (i.e the values of new variable is based on the values of existing variables)
- Appending tables
- Joining tables (Inner Join, Left and Right Join, Full Outer Join)
- Transpose tables
- Summarize column or summarize column by groups
- Normalizing and standardizing columns (for continuous variables)
- Binning of continuous variables
- Imputing missing values in a variables
The above are the list of data cleaning tasks data analyst or data scientist need to be familiar with. The list might not be comprehensive but its a good start.
I hope the above list can help you prepare your data better and be on your way to build up an impressive data science portfolio.
I wish all readers a FUN Data Science and Artificial Intelligence learning journey. Do check out my other blog posts. Keep in touch on LinkedIn or Twitter, else subscribe to my newsletter to find out what I am thinking, doing or learning. :)