Becoming a Data Scientist - Part 1

An FAQ I get most of the time at a meetup, seminar or training session is, “Given my XXX (to list the common ones, they are Computer Science, Statistics, Engineering, Economics) background, how do I get  started on Data Science? How do I build up my skills and knowledge so I can embark on Data Science as a career?”

So I decide to write several posts here that can help individuals to keep tab on their Data Science skills/knowledge inventory.

From  the macro view, I usually show the following Venn diagram to help with  understanding on the skills/knowledge that is needed.

There are a lot of Venn diagrams out there that describe what Data Science is, here is a list of them.

The  thoughts behind my Venn diagram is to help people  understand the skills and knowledge that are needed, to guide people on  becoming a data scientist and I wanted to be as precise as possible so that readers can be more focused in their learning journey. Thus you may find it “cleaner” compared to  other Venn diagram that you have seen.

Venn Diagram

There are three components of the Venn diagram:

1- Data & IT Management

2- Mathematical Models

3- Domain Expertise

Data & IT Management

Being a data scientist, we have to  advise on a few areas in the IT and Data Infrastructure, areas such as  how to handle missing values, can data be captured at a more granular level, how to improve on data quality, how to implement the scorecard into existing systems etc. With a good understanding of the Data &  IT Infrastructure, we can then proposed constructive suggestions on managing data and using the models that we have built. Through practical suggestion, data science can continue to add value and flourish in an organization.

Mathematical Models

Mathematical  models would need no explanation. It is essential for data scientist to know and understand it. I will like to point out there is a need to consider computation complexity and not a one way street into “highest accuracy” ville.

Just in case you are wondering, statistics is also included as well.

Domain Expertise

So what about domain expertise? Well, previously I put the circle as "Business Expertise" rather but as the experience accumulates, I notice that NGOs and Charities are beginning to tap onto their existing data to  make the donations or causes go longer. Thus I decided to change it to  “domain expertise” instead, to correctly reflect the current environment with regards to data science.

Generally,  when we decide to build any models, data scientist should think about stakeholder’s reaction to it. For instance, if we build a model that segment students and provide resources to students that are likely to  succeed after the segmentation, this would create an uproar among students, especially those classified as “poor”. Thus we would like to  structure the business/organization objectives and models in a way that really meets the business objectives without bringing “damages” to other aspect of business. And that requires good knowledge of how business works, for instance understanding its business model, processes & operations, regulations etc

Another  example would be, if we are required to build a recommender system, accuracy would never be the sole consideration in selecting the best model for the tasks. As a data scientist, we would also have to determine the computation complexity of the chosen model as well. Here is a real-life example from Netflix (article)

Conclusion

A good data scientist never stop learning, why is that so? If you look at the three areas that data scientist need to have skills and knowledge in, they are changing everyday. In 2017 - 2018, Hadoop & Spark was mentioned a lot of times, and its an essential skill that data engineer or data scientist should have. Fast forward to 2020, who is talking about them? The infrastructure we are talking about these days is cloud computing.

In the early part of last decade, most people knows Neural Network as having a single computation layer till AlphaGo came into the picture and everyone saw a lot of breakthrough in the Artificial Intelligence front because of Deep Learning, a derivative of Neural Network. This example shows that new machine learning algorithms are being invented and thus the job of a data scientist is to learn and understand them. By the way, we have not touch on quantum computing yet. That willl be a whole new paradigm.

Moving on, if you are interested to find out more for each individual area mentioned in the post, do check them out here.

Data & IT Management” and “Mathematical Models” and “Domain Expertise

I hope you have fun during your learning journey and I wish you all the best! :)

(Note: This post was written previously for Medium and this is an edited version of it. Updated as well. Original post can be found here.)