In a previous post, I was sharing tips on how you should prepare your Data Science resume and project portfolio (Here is the post for a recap). In this post, I will give some suggestions on how to start working on your first project.
Step 1: Learn the Concepts
Start learning the concepts first. Try to understand what machine learning is, the common models, and also the mathematics behind. You MUST know the mathematics behind. I have a post that you can reference to know which branches of mathematics you need to learn.
Step 2: Domain of Interest
Ask yourself what domain will you be interested to work in? For instance, marketing, operations, logistics, etc. Why is it important? If you are not passionate about the topic at hand, chances are you will give up on the project halfway through when things get tough. Having said that, my suggestion is to have a list of domains and rank them by your interest. The reason is having an interesting domain is one, but being able to find the domain data may be challenging. For instance, you might be interested in Fraud Detection but for security purposes, fraud data are seldom shared online.
Step 3: Finding Data
There are lots of open datasets online. You just need to look hard enough. Places like the following are where you can look for:
1) UC Irvine Machine Learning Repository
2) Kaggle
3) Open City Datasets from the United States (Boston, San Francisco, New York, Chicago)
Images data, Text data, and Network data can be found easily too. Else this might be a good chance to learn web scraping but please note the user license agreement of the websites, as not all websites allow web scraping.
Step 4:Start the Project
First, ask yourself what business challenge you will like to solve using the data that you found. Ask yourself how are you going to solve the business challenge and is there a better way to solve it. Plan the steps in the project.
In parallel, start picking up a tool you are going to use for the project. Usually Python or R.
Start exploring and analyzing your data for raw insights, because analysis skills are very important to a data scientist. This is an opportunity to sharpen it. Here are a few tips to explore the data, non-visually and visually.
You will be using the tool to clean and query the dataset. For a list of data cleaning task, you should learn for your project, check out my post here.
After the data is cleaned to a satisfactory level, it is time to start the machine learning phase, train numerous models. Here are a few posts that are useful.
- "Complex" Machine Learning Models is ALWAYS Better?
- Model Selection: Accuracy, Precision, Recall, or F1?
Remember to document down your thought process through Github and blog posts. Also write down your analysis, insights that you derived from your final selected and trained machine learning model.
Step 5: Propose Strategy
Look at your insights. Propose and devise strategies to take advantage of the insights. Make reasonable assumptions if needed and document these strategies down. Why? At the end of the day, these strategies are the final step to realize value from data. If your strategy cannot exploit the insights from the trained model, all your previous steps are as good as wasted. Do some research, then document down the strategy that you propose. Show your potential employer that you know how to get value from data.
Finally, try to explain your insights to your friends, people who are not in your field. Let them ask questions about your analysis. The more question the better because it gives you an opportunity to sharpen your communication skills when you are answering their questions.
Conclusion
At the end of the day, remember that you are to demonstrate you can solve business challenges with data. So you need to display that capability throughout your project. I hope the steps and associated posts can help in your coming up with your first (and many more) data science projects.
If you found the post to be useful, consider sharing it. Do sign up for my newsletter! Stay in touch on LinkedIn and Twitter too! :)