Becoming a Data Scientist - Part 1
An FAQ I get most of the time at a meetup, seminar or training session is, “Given my XXX (to list the common ones, they are Computer Science, Statistics, Engineering, Economics) background, how do I get started on Data Science? How do I build up my skills and knowledge so I can embark on Data Science as a career?”
So I decide to write several posts here that can help individuals to keep tab on their Data Science skills/knowledge inventory.
From the macro view, I usually show the following Venn diagram to help with understanding on the skills/knowledge that is needed.
There are a lot of Venn diagrams out there that describe what Data Science is, here is a list of them.
The thoughts behind my Venn diagram is to help people understand the skills and knowledge that are needed, to guide people on becoming a data scientist and I wanted to be as precise as possible so that readers can be more focused in their learning journey. Thus you may find it “cleaner” compared to other Venn diagram that you have seen.
Venn Diagram
There are three components of the Venn diagram:
1- Data & IT Management
2- Mathematical Models
3- Domain Expertise
Data & IT Management
Being a data scientist, we have to advise on a few areas in the IT and Data Infrastructure, areas such as how to handle missing values, can data be captured at a more granular level, how to improve on data quality, how to implement the scorecard into existing systems etc. With a good understanding of the Data & IT Infrastructure, we can then proposed constructive suggestions on managing data and using the models that we have built. Through practical suggestion, data science can continue to add value and flourish in an organization.
Mathematical Models
Mathematical models would need no explanation. It is essential for data scientist to know and understand it. I will like to point out there is a need to consider computation complexity and not a one way street into “highest accuracy” ville.
Just in case you are wondering, statistics is also included as well.
Domain Expertise
So what about domain expertise? Well, previously I put the circle as "Business Expertise" rather but as the experience accumulates, I notice that NGOs and Charities are beginning to tap onto their existing data to make the donations or causes go longer. Thus I decided to change it to “domain expertise” instead, to correctly reflect the current environment with regards to data science.
Generally, when we decide to build any models, data scientist should think about stakeholder’s reaction to it. For instance, if we build a model that segment students and provide resources to students that are likely to succeed after the segmentation, this would create an uproar among students, especially those classified as “poor”. Thus we would like to structure the business/organization objectives and models in a way that really meets the business objectives without bringing “damages” to other aspect of business. And that requires good knowledge of how business works, for instance understanding its business model, processes & operations, regulations etc
Another example would be, if we are required to build a recommender system, accuracy would never be the sole consideration in selecting the best model for the tasks. As a data scientist, we would also have to determine the computation complexity of the chosen model as well. Here is a real-life example from Netflix (article)
Conclusion
A good data scientist never stop learning, why is that so? If you look at the three areas that data scientist need to have skills and knowledge in, they are changing everyday. In 2017 - 2018, Hadoop & Spark was mentioned a lot of times, and its an essential skill that data engineer or data scientist should have. Fast forward to 2020, who is talking about them? The infrastructure we are talking about these days is cloud computing.
In the early part of last decade, most people knows Neural Network as having a single computation layer till AlphaGo came into the picture and everyone saw a lot of breakthrough in the Artificial Intelligence front because of Deep Learning, a derivative of Neural Network. This example shows that new machine learning algorithms are being invented and thus the job of a data scientist is to learn and understand them. By the way, we have not touch on quantum computing yet. That willl be a whole new paradigm.
Moving on, if you are interested to find out more for each individual area mentioned in the post, do check them out here.
“Data & IT Management” and “Mathematical Models” and “Domain Expertise”
I hope you have fun during your learning journey and I wish you all the best! :)
(Note: This post was written previously for Medium and this is an edited version of it. Updated as well. Original post can be found here.)