The Data Science Lifecycle

Table of Contents

Understanding business case
Data mining
Data cleaning
Data exploration
Feature engineering
Predictive models
Visualize & understand

Data science has become an essential tool for business forecasting and management decisions. It allows you to study and predict the behavior of consumers and markets. Today, we’ll explore the data science lifecycle and how you can apply it in your organization.

The purpose of data science is to find hidden patterns in large data sets with statistical analysis. These patterns can uncover “secret knowledge” that allows you to build models, make predictions, and ultimately get ahead of your competitors!

1. Understanding Business Case

First, you need to ask yourself what you’re trying to solve. Are you analyzing customers’ behavior? Optimizing marketing content? Expanding sales channels?

Let’s say you are trying to predict holiday sales for your company. Will you forecast both online and retail sales? Will you segment the buyer population? Are you forecasting sales for all product lines? What timeframe do you consider “the holiday period”?

Make sure all parties involved agree on the goals and KPIs of the project. Only when this is clearly defined can we start searching data for the job, also called data mining.

2. Data Mining

Data mining is the process of collecting and storing data relevant to your project. Ideally, you can use internal data which your company has been collecting over time. Even better is if the data is already structured and segmented in some way. This illustrates the importance of having a streamlined and structured data pipeline since it allows you to quickly collect data for research projects.

As Jonathan Johnson notes, “the data pipeline should seamlessly get data to where it is going and allow the flow of business to run smoothly”. An ideal pipeline should encompass every stage of the data journey. For more useful tips, check out his entire article on the data pipeline.

If you don’t have access to internal data, you will need to consult external data sources or start collecting your own data. External sources will be cheaper and faster, but they might not contain the specific data you need. Collecting data is more time-consuming, but it might be worthwhile to obtain the right data.

For the holiday sales example, we can look at prior sales data from our company, delivery times of suppliers, visitors on our webshop, demand for our products and similar products from competitors, economic indicators, and so forth.

3. Data Cleaning

After collecting the data set, it will need to be cleaned and prepared for analysis. Data will almost always contain erroneous or unusable values. Think of spelling mistakes, duplicate values, empty cells, or irrelevant data for the project. Removing these with the help of data science techniques will improve the results of your analysis and predictions.

Note: these first three stages can easily take up 80% of the time. This is because collecting, cleaning, and organizing data is painstaking work. Be aware when planning out your roadmap.

4. Data Exploration

Once we've collected and cleaned data, it’s finally time to start exploring our data set. We can apply different statistical techniques to get a “feel” for the data. For example, we can plot sales during the last 5 years and see how they evolved. We can compare online vs. retail sales in a pie chart. We can measure the impact of discounts on our sales with a correlation matrix, etc.

5. Feature Engineering

When designing a model, we will need to choose which features to extract from the data. These are the inputs we think are the most relevant for predicting our sales. As Emre Renberoğlu states: “all machine learning algorithms essentially use some input data to create outputs”. Feature engineering serves two purposes: preparing the data set to be compatible with our model and improving the performance of our model. The better our inputs are defined, the more accurate our model outputs will be.

For a more comprehensive view of feature engineering, visit Emre’s blog post.

6. Predictive Models

Once we’ve established our features, we will use them to build models. This can be done with several data science techniques, including classification, regression, and clustering. One of the most common applications is linear regression. Say we want to investigate the correlation between past sales and marketing expenditures. We can use linear regression to project that into the future for a reasonable estimate of what our sales might be.

For a high-level overview, we recommend Jason Wong’s wonderful blog post on linear regression. It covers the concepts “under the hood” of regression, simple and multiple linear regression, and how to test your results for biases.

7. Visualize And Understand

The final step of the data science lifecycle is to analyze your results and decide how to act on these newfound insights. If you’re confident that your model is highly accurate, you can start planning for the holiday sales to enjoy the most profits. Or you might need to communicate your results with stakeholders. The best way is to visualize your results with colorful and intuitive graphs.

If you think something is not right, you can restart the process and see where things went wrong. Perhaps the data contains too many erroneous values. Maybe the features need to be more fine-tuned. Or the linear regression was biased from the start. Keep repeating the steps until your results become more accurate!

Conclusion

Data science allows us to unlock new insights from large data sets with statistical analysis. While the math behind these techniques can be complex, the core ideas are fairly simple and can be understood without a technical background. These techniques can ultimately help us make better decisions and provide a powerful edge in business.