The dataset structure is often overlooked for folks who are beginning their data science journey, especially those that have gone for bootcamps/courses. The reason is due to time constraint, datasets are usually cleaned and ready to be taken up by the software, i.e. students doesn't have many exercises in data munging and may have unknowingly taken for granted that that is how it happens in the industry.
Here are a few questions to consider when structuring a dataset for analysis.
1 - For analysis purposes, the data needs to be consolidated at which level? Customer, Transactional, Store, or perhaps Manager level?
2 - How shall I join the dataset to derive the data that I need for my analysis? What is the primary key for merging? Will the primary key be unique to achieve the join I want?
3 - What is the definition of my "target"?
4 - How should the dataset be structured in a way that the software will accept it and generate the analysis, #visualization that I want to portray? Should it be a Narrow (One Entity Multiple Rows) or Wide (One Entity One Row)?
5 - How shall the code be written/steps to take to structure the acceptable dataset?
In the industry, data are at the very basic level or normalized so as to make maintenance, such as updating and reducing duplicates, easier. As such it is the role of any data professional, be it data scientist or data analyst to put them together in a way that is ready to be "fed" into the software for number crunching, analysis, or visualization.
This comes with practice and as such goes back again to why I am a strong advocate for having a personal project portfolio. While working on your portfolio, you will get a lot of experience in handling data and its possible errors and thus "taming the data dragon" before tackling the "machine learning dragon".
I've just started a contribution page at buymeacoffee.com. If I have been adding value to you, please consider making a donation to let me know I have made an impact. :)