written by

Deep Dive #1: CRISP-DM

Management in AI 3 min read

Cross-Industry Standard Process for Data Mining

This is the first of a 3-part series on Deep Dives of Data Science Methodologies.

The Cross-Industry Standard Process for Data Mining is the standard model for building a data science project. Its stages are:

1. Business Understanding

2. Data Understanding

3. Data Preparation

4. Modeling

5. Evaluation

6. Deployment

Let’s say we are a travel agency with thousands of daily customer service queries. We are building a chatbot to resolve simple queries without human input. After each solution provided by the chatbot, we send a quick survey asking the customer to rank their satisfaction from 1-10. If the ranking is lower than 8, customers are able to provide written feedback. The 6 stages would be defined as following:

Business Understanding: we want to build and improve a chatbot that automatically resolves customer service queries.

Note: defining the business problem first is essential to having a successful analytics project. As Zipporah Luna writes, “not knowing how to frame the business problem is a problem itself”. She adds that some data science teams will start a project by focusing on the data set and what model to use, without addressing the original business need.

Data Understanding: we’ll collect data from survey responses. All data will be organized by category of issue that was resolved, along with a score of how satisfied the customer was with the provided solution.

Data Preparation: each data record contains three values: the category of issue, a numerical ranking of the customers satisfaction and possibly a text comment. The first two are easy to store and access, and require minimal preparation and cleaning. The written feedback will be used to assess outliers and improve the chatbot.

Modeling: the model we’re building is the chatbot: an automated service which can direct customers to the right support articles or execute simple tasks like issuing refunds or updating on a case status. This model will have to be continuously updated based on the feedback we’re collecting from customers. This is what makes our project iterative.

Data Evaluation: we can plot the satisfaction rankings over time and evaluate if the quality of our chatbot improves or declines. We can pinpoint which issues are poorly solved by our chatbot by isolating negative ratings, evaluating written feedback and making improvements.

Deployment: the chatbot should be deployed once we have a basic working system without producing errors, and should be continuously improved to keep serving the original business need. This is what iteration means. Let’s talk more in detail about iteration.

The arrow from Evaluation to Business Understanding enables iteration.

Rajiv Shah writes that “the iterative approach in data science starts with emphasizing the importance of getting to a first model quickly, rather than starting with the variables and features. Once the first model is built, the work then steadily focuses on continual improvement.”

Iteration is a common approach in computer science methodologies and has multiple benefits:

- Iteration allows for continuous improvement of a product or service.

- Iteration also helps to stay relevant in a dynamic environment.

- Iteration lets developers fine-tune models before deploying them on a massive scale.

Machine Learning is a constant iterative process, and understanding this concept will help understand many fundamental techniques of ML and AI. Niwratti Kasture emphasizes that “you are never guaranteed to have a perfect model until it has gone through a significant amount of iterations to learn the various scenarios from the data.” He nuances by adding that a “perfect model” doesn’t exist and instead, models should go through the Machine Learning cycle until a desired confidence level is achieved.

In our next article, we are deep diving in the KDD methodology. Check it out here!


To learn more about CRISP-DM and the iterative process in data science, check out our amazing sources and their profiles!

Zipporah Luna: CRISP-DM Phase 1: Business Understanding

Rajiv Shah: Measure Once, Cut Twice: Moving Towards Iteration in Data Science

Niwratti Kasture: Machine Learning — Why it is an iterative process?