Deep Dive #2: KDD

Knowledge Discovery of Databases

This is the second of a 3-part series on Deep Dives of Data Science Methodologies.

In our second episode of Deep Dive Methodologies, we are looking at the Knowledge Discovery of Databases (KDD). While CRISP-DM encompasses the entire data science lifecycle, KDD focuses on the data mining stage. It comprises 5 stages:

1. Data Selection
2. Data Pre-Processing
3. Data Transformation
4. Data Mining
5. Evaluation

Rashmi Karan writes: “The purpose of KDD is the interpretation of patterns, models, and a deep analysis of the information that an organization has gathered to make better decisions.” In other words, we want to perform statistical analysis on large data sets to gain actionable insights.

Let’s say we’re a data science team working for a health insurance company in North America. We have access to a huge dataset of customer records and we’re asked to segment them.

By dividing customers into groups, we can offer tailored insurance packages. We can create targeted advertising to reach the right audiences. Customer segmentation also allows us to correlate different health categories with associated risks.

Since this task is mainly focused on data mining, we will use the KDD methodology. We will use our hypothetical scenario to explain each stage:

Data Selection: Since we have access to massive customer records, we can select the data points most relevant for the segmentations we want to create. Possible factors include age, sex, weight, profession, income, medical history, etc.

Data Pre-Processing: Before we can segment customers, we must pre-process our records by formatting them in a coherent structure and fixing inaccuracies and incorrect values.

Data Transformation: Since we are dealing with many variables, we might need to perform transformations such as dimensionality reduction. According to Nilesh Barla: “dimensionality reduction is reducing the number of features in a dataset.” This is important because a model with too many features becomes too complex to work with. Nilesh continues: “the higher the number of features, the more difficult it is to model them. This is known as the curse of dimensionality”.

Data Mining: Once we have collected, processed, and transformed our data points, we can start with the most important stage: data mining. This is where we extract insights from our data set. One powerful technique that helps with segmentation is known as “clustering”.

“Clustering is an unsupervised data mining (machine learning) technique used for grouping the data elements without advance knowledge of the group definitions.” writes Srinivasan Sundararajan. He adds that “these groupings are useful for exploring data, identifying anomalies in the data, and creating predictions.” This is true whether you’re working in healthcare, banking, e-commerce, gaming, and more.

Evaluation: Once we clustered our customer records, we can evaluate if we are satisfied with the number of groups and types of features we used to segment them. Perhaps we use one segmentation to advertise new insurance packages, but very few people respond. In that case, we can create new segmentations and evaluate if that generates more responses to our advertising campaign.

KDD is a powerful methodology for data mining projects and covers every stage in detail. While it’s more oriented towards data engineers, a data science team must also include experts in mathematics, computer science, and business. Our next methodology helps to define team roles.

SOURCES:

To learn more about the KDD methodology, check out our amazing sources and their profiles!

Srinivasan Sundararajan: Patient Segmentation Using Data Mining Techniques

Rashmi Karan: Knowledge Discovery in Databases (KDD) in Data Mining

Nilesh Barla: Dimensionality Reduction for Machine Learning