Data & IT Knowledge needed to be a Data Scientist

In my last two posts, I have given an overall view of the skills and knowledge needed by a data scientist and also did a discussion on the mathematics that the data scientist should know. Continuing, I shall now touch on the Data and IT management knowledge that a data scientist should know.

1 - Data Governance & Management

As data is the lifeblood of data science, they need to be carefully managed like all other strategic assets in a company. Thus it is important that the data scientist have knowledge in data governance and management, to ensure that data is of the highest quality possible. A chef would want the freshest ingredient possible so as to to cook the tastiest dishes. Similarly, the data scientist will like to have the best quality data possible, so that the insights have business value.

Wikipedia defined Data Governance as a set of processes that formally manages data so that data can be trusted and together with accountability, adverse data events, such as missing data, poor data quality, data leakage can be reduced to the minimum.

The data scientist need to be mindful and should even feedback on the data governance processes to ensure data is of the highest quality and of similar priority, ensure data privacy and security, making sure that only those that need to access can access the data.

Here is a flavor of what data governance is. Take for instance, the access of data by new employee. Data governance will spell out the process (forms to fill & approvals to seek) that needs to be followed to give access rights to the new employee. For instance, what is the information required to make a decision on granting the access rights, what kind of access rights to grant, how long is the access required, who has the authority to grant the access, after the decision is made who effect the decision etc. This is, of course, just the tip of the iceberg called “data governance”.

Along the way, as the organization progresses up the learning curve, the data scientist would also need to propose data collection strategy, the right granularity level of data, the technology & infrastructure to support the strategy and also associated data governance process to better manage the new data collected.

Data Management & Data Quality

The importance of maintaining high quality data can never be stressed enough. However, data quality can be very vague to the layman. To better understand what data quality is we can measure them based on certain dimensions. Below are the dimensions:

Accuracy — To what extent, does it reflect the reality?
Completeness — Have we gotten all the possible data?
Timeliness — Is the data available when I need it?
Validity — Does it conformed to the defined format?
Consistency — Is the format the same across different tables?
Uniqueness — No duplicates?

Along the learning journey, the data scientist is bound to come across more dimensions of data quality. The data scientist can build up his own list along the way. More importantly is the data scientist can and should feedback on the kinds of metrics that can be used to measure data quality. Besides the metrics to measure data quality, being able to set the right alert level (for instance, to sound an alert when % of missing value crosses 1.0) is important to ensure “dirty” data does not seep in.

Data quality is just one aspect of data management. Other aspects of data management could be for instance, automation, validation, ETL processes, back-up, access rights etc.

Enterprise Data Warehouse (EDW)

The data scientist need not build the whole enterprise data warehouse but it would be good for them to have some idea how to structure it because the structure can affect the data quality of the data especially the timeliness and uniqueness aspect. When building the EDW, the data scientist can add value by bearing in mind the need for business continuity in extreme times and also keep in mind the ETL (Extract, Transform & Load) processes.

2 - Computer Architecture

In my university days, I took an “Introduction to Computers” module and the knowledge gained has helped me tremendously in my work, in understanding how computers work and how computing takes place in the computer. Things like memory bus, cache, memory, hard-disk, CPU etc, was taught in that module. It gave me a lot of appreciation on the hardware side of computing and how it can limit/enhance processing. The knowledge serves as good foundation for me to understand how technology works behind the scene too. So I must say a good understanding of computer architecture gives the data scientist the ability to propose feasible solution in capturing and maintaining data, implementing models and algorithms in IT systems.

Enterprise IT Architecture Design

Besides understanding how computer works, it would be great for the data scientist to have some appreciation of how IT architecture are planned, designed and built in the organization.

Reason for that is data scientist may need to implement the final/chosen models into the enterprise IT architecture and having a good understanding helps data scientist to propose feasible ideas in data collection strategy or model implementation and also during the training of models, be able to take note of the constraints and possibilities of the IT architecture when it comes to embedding these models into the architecture, (i.e. good integration of the model into the existing IT systems and business processes, ensuring a smooth flow of data and numbers). For instance, some legacy systems cannot take into account composite variables (i.e. X1X2) thus the data scientist would not be able to build a machine learning models that has composite variables but rather only can use simple features (X1 and/or X2).

3 - Programming Languages

Given that most data scientist would need to tap on computers’ immense computation capabilities, the data scientist cannot escape from coding unless he/she wants to restrict career opportunities to only corporate environments that use a lot of point&click (i.e. GUI) software. Moreover, being proficient in coding allows the data scientist more flexibility in setting the models’ hyper-parameters, loss functions, allowing more “creativity” into the models built.

I can understand why people can be averse to programming because its like trying to talk to a foreigner in their native language which we have minimal familiarity with and hope the foreigner can get the full picture what we are trying to convey. My first programming language was Java and I sucked at it when I took a module on it in my undergrad days. As I move on in my career, I realized that I would be limiting my opportunities if I do not deal with programming and decided to pick it up again. Thank goodness, the next ‘programming’ language that I picked up was SAS and because of the controlled environment, it was a more pleasant experience picking it up. It gave me the confidence to pursue further and moving on to R and Python (open source).

Learning programming has gotten easier given that many IDEs out there have made it easier (through suggestion & color codings) and the tremendous amount of resources available such as YouTube, Open Course Ware, blogs, StackOverflow etc. One can pick up programming on their own or if they need structure in their learning, approach programming bootcamps, which are still the rage these days.

If I may propose another angle of learning about programming, I found that programming is like solving a logical puzzle. Solving bugs requires one to think logically and have a good understanding about how the language works behind the scene. So being able to resolve a bug (although the process is supremely and absolutely frustrating), actually does provide a sense of achievement. To me at least, it feels like after struggling to solve a puzzle for a long while, the minute the solution is obvious, a barrel of feel-fantastic just opened up (and a sense of achievement too).

Moreover, there are many ways to reach the end or objective through programming. Being proficient to code it in a way that codes can run efficiently, also gives one a sense of achievement too. For anyone starting on data science, I would strongly encourage one to explore programming and the language that data scientist commonly used these days are R and Python, but in recent years, Python seems to be leading R.

To a data scientist, I believed in getting the concepts correct first and then figure out the tools to use later, because with the right concepts, one has the “right” training wheels fitted to pick up the language efficiently. All these articles about which tool is better are mere click bait articles that the only people gaining from it are the writers of the articles (paid in the most expensive currency called Time).

4 - Software and Hardware Tools

As a data scientist, we have update ourselves on the latest tools available, both enterprise and open-source tools. Its because organizations would rely on the data scientist to propose the right tools to be used for each project.

To give an analogy, say you are given a toolbox with all kinds of tools inside. Looking at the challenge at hand, you would only use tools (hammer, pliers, screwdrivers) that you are familiar with and know that it is effective in resolving the challenge. We know that the more tools that we are familiar with, the chances of solving the challenge at hand will get higher.

Similarly, the data scientist should try to be familiar with all kinds of tools, both software and hardware and have some idea how it works, what are the pros and cons of using it, what kind of situations would it be effective, what are the maintenance costs, possible integration issues with the current set of tools etc. (Sort of a system analyst role). Having a good idea what the tools out there are gives the data scientist that added value that is important for the organization to tap onto their data for more insights.

The data scientist need not read in-depth (up to documentation level) to be able to propose the tools but rather, understand at the most fundamental level how it works and its pros and cons. If there are time and resources to experiment with it, then please do as well so that the data scientist can have a good idea the implementation challenges.

Some suggested tools that data scientist should be familiar with are visualization & dash-boarding tools, machine learning tools, data warehousing & process tools and computation tools.

Conclusion

For data science to work, the data scientist have to take great care of the data, ensuring that the data is trustworthy and at the quality that meaningful insights can be achieved. This can be achieved if the data scientist also participated in the IT infrastructure building and maintenance. Being able to propose feasible solution helps in building up the credibility of the data scientist, making the data science team a close-knit team with high internal trust, something that is important given that data science requires team efforts in organizations.

These are the Data & IT management knowledge that I think the data scientist should know minimally. For the next blog post, I will be discussing on the domain expertise that a data scientist should know to be effective.

Have fun in your data science learning journey and do visit my other blog posts and LinkedIn profile for data science nuggets.

(Note: This post was written previously for Medium and this is an edited version of it. Updated as well. Original post can be found here.)

Becoming a Data Scientist - Part 3: Data & IT Management

Koo Ping Shung

Koo Ping Shung

Wishlist for Singapore Government (2025 onwards)

Does Your Business Need an AI Strategist?

Critical Thinking: Getting to First Principles

Becoming a Data Scientist - Part 4: Domain Expertise

Becoming a Data Scientist - Part 2: Mathematics