Using Excel is not Data Science?

"If you use Excel, it is NOT data science." Surprisingly, I heard similar comments a lot of times. So this post I am going to discuss it and share my viewpoint.

For all who followed my posts or podcast, you will know that my definition of data science is about getting value from data, helping businesses to solve challenges through data. Sounds simple but once you get to the amount of knowledge, skills, and tools you have to go through, it is guaranteed to get overwhelming.

If you look at my definition, tools are a means to an end. Tools should never be used as a gauge to determine if one is doing data science or not. That is putting the cart in front of the horse. Wrong focus!

The discussion on tools should always be focused more on which tools are useful for what kind of circumstances. I believed every tool have their pros and cons and as a data scientist, we need to understand them and apply the tools according to the project that we are working on. For instance, Excel can calculate and generate decision trees, if one sets the formula up correct but there are better (efficient) tools like R or Python rather with a few lines of codes.

Excel is Dead for Data Science?

Excel is not going away, in my opinion. It is a ubiquitous tool, every company has it. Finding someone proficient in it is easy as well since tools are useless without someone using it. These are the reasons why any company that is starting on the data journey, they should start with Excel. Excel allows users to accomplish most data cleaning tasks easily, there are simple visualizations users can build, and last but not least, access to common models like multivariate linear regression and operation research models (through the Solver). Using Excel allows a company to quickly extract value from their data and at a LOW COST.

Having said that, companies, as they mature in data, their needs are bound to change. As their data grow in size, it makes more sense to use a database, followed by easier access rights management. When setting up the ETL (extract, transform & load) processes, a scripting language like Python will be better. When there is a need to use other machine learning models, such as support vector machines, gradient boosting, deep learning, etc for possible better insights, Python and R might come in handy.

Having said that, it still will not eliminate Excel because Excel will still be a great tool for fast prototyping and testing of ideas given its ease of usage and familiarity with a larger group of people.

In conclusion, my opinion is that it is wrong to measure whether someone is doing data science or not with the tools that they used. :)

If you find this post to be useful, please feel free to share. Any feedback can be sent over through LinkedIn or Twitter and do consider subscribing to my newsletter (below) to stay in touch with what my thoughts and learning are. :)