Customer Segmentation using Data Science

February 6, 2021

TL;DR: A Data Science Tutorial on using K-Means and Decision Trees together.

Customer segmentation (sometimes called Market Segmentation) is ubiqutous in the private sector. We think about bucketing people into mutually exclusive and collectively exhausting (MECE) groups. The premise being that instead of having 1 strategy for delivering a product or experience, providing experiences or strategies will yield much better engagement or acquisition from our customers.

Generally speaking, this makes sense; it's intuitive. Provide people a more curated experience and they will enjoy it more...and the more personalized the better.

Netflix, Spotify, YouTube, Twiter, Instagram, and all of the big tech companies have mastered personalization by using robust, computationally expensive, and sophisticated machine learning pipelines. But the world has been doing this for a long time, just a much less sophisticated version.

So I thought I'd give a technical demo of what customer segmentation looks like in a basic way using a trick I've used for years.

Here are the things I'd like to cover during this demo:

  1. What options do I have to segment my customers?
  2. How do I actually do the segmentation?
  3. What can I do with my new customer segments?
  4. How do I know that my segments are effective?
  5. How do I know when my segments have changed?

Approaches to Customer Segmentation

The phrase "Customer Segments" tends to mean different things across different industries, organizations, and even across business functions (e.g., marketing, risk, product, etc.).

As an example, for a consumer products retailer, they may refer to customer segments using both demographic information or their purchase behavior, where a lender may refer to their segments based on credit score bands. While very meaningfully different from a business perspective, the same algorithms can be used for both problems.

Analytically speaking, I've seen Customer Segments defined really in two main ways: (1) Business Segments and (2) Algorithmic Segments. Usually executives refer to their segments in the first category and data scientists focus on the second. The first is really important organizationally because 99% of the people working with your customers don't care about how you bucketed them and customers are the most important thing. Always.

...but how do you actually (i.e., in code and data) get to those segments?

1. Logical Business Segments

These segments tend to be defined by heuristics and things that make common sense. They are often built on things that are aligned with the goal of the business.

Here are some examples:

  • The age of the customer (in years)
  • The income of the customer (in dollars or thousands of dollars)
  • The amount of money a customer spent in the last year
  • The likelihood a customer will spend money at a given store (purchase propensity / propensity to buy)
  • The customer's geographic region (e.g., zipcode, state)

In data, some of that customer information would look something like this:

User ID Age Customer Income Purchase Propensity ...
1 25 $45,000 0.9 ...
2 30 $80,000 0.4 ...
... ... ... ... ...
n 56 $57,000 0.1 ...
And so on.

We could apply some logic/rules/code to create segment like:

  • Age Buckets
    1. < 25
    2. 25-35
    3. 35-55
    4. 55+
  • Income Buckets
    1. < $25K
    2. $25K-50K
    3. $50K-100K
    4. $100-150K
    5. $150K+
  • Propensity Buckets
    1. Low: [0, 0.25]
    2. Medium: [0.25, 0.75]
    3. High: [0.75, 1.0]

And map that logic into our data, which would yield

User ID Age Bucket Income Bucket Propensity Bucket ...
1 25-35 $25K-50K High ...
2 25-35 $50K-100K Medium ...
... ... ... ... ...
n 56 $50K-100K Low ...
And so on.

Pretty simple, right? The code for this categorization is simple too (assuming you're using Pandas and Python; though it's also simple in SQL).

1# Here's one example
2import numpy as np
3import pandas as pd
4
5cdf['Income Bucket'] = pd.cut(cdf['Annual Income ($K)'], 
6    bins=[0, 25, 35, 55, np.inf], 
7    labels=['<25', '25-35', '35-55', '55+']
8)

This is a really helpful and simple way to understand our customers and it's the way that most businesses do analytics, but we can do more. 😊

2. Algorithmic Segments

Segments defined using simple business logic are great because they are so easy to interpret, but that's not free. By favoring simplicity we have to limit ourselves to (potentially) suboptimal segments. This is typically on purpose and entirely fine but, again, we can do better.

So how do we do better?

Cue statistics, data mining, analytics, machine learning, or whatever it's called this week. More specifically, we can use the classic K-Means Clustering algorithm to learn an optimal set of segments given some set of data.

To skip over many important details (more here), K-Means is an algorithm that optimally buckets your data into groups (according to a specific mathematical function called the euclidean distance). It's a classic approach and tends to work quiet well in practice (there are a ton of other neat clustering algorithms) but one non-technical challenge is (1) choosing and (2) explaining what a single cluster actually means to literally anyone else.

Solving (1) is relatively straight-forward. You can run K-means for some number of from [0, ] ( and choose what appears to be a that sufficiently minimizes the within-cluster sum-of-squares (i.e., ). Here notice that the the majority of the variation of the clusters can be capture by .

The Inertia Function!

Inertia as a function of k

Now to (2), which is the harder challenge. If I were to plot my data and look at the clusters, I'd have something that looks like:

K-Means!

Look at all 3 of those beautiful dimensions!

How cool, right? This little algorithm learned pretty clear groups that you can see rather obviously in the data. Impressive! And also useless to your boss and colleagues.

More seriously, while you can see these clusters, you can't actually extract a clear description from it, which makes interpreting it really, really hard when you go past 3 dimensions.

So what can you do to make this slightly more meaningful?

Enter decision trees. Another elegant, classic, and amazing algorithm. Decision Trees basically split up your data using simple if-else statements. So, a trick that you can use is to take the predicted clusters and run a Decision Tree (Classification) to predict the segment and use the learneed Tree's logic as your new business logic.

I find this little trick pretty fun and effective since I can more easily describe how a machine learned a segment and I can also inspect it. Let's suppose I ran my tree on this learned K-means, what would the output look like?

Decision Tree Ouput!

Is this really more interpretable?

There you have it, now you have a segmentation that is closer to optimal and somewhat easier to interpret. It's still not as good as the business definition but you could actually read through this and eventually come up with a heuristic driven approach as well, which is why I like it and why I've used it in the past.

And here's the code to run the K-means and the Decision tree.

1import pydotplus
2import pandas as pd
3from sklearn.tree import DecisionTreeClassifier
4from sklearn.tree import export_graphviz
5from sklearn.externals.six import StringIO  
6
7optimal_clusters = 6
8# 6 clusters 6 colors
9xcolors = ['red', 'green', 'blue', 'orange', 'purple', 'gray']
10# Chose 6 as the best number of clusters
11kmeans_model = (KMeans(n_clusters = optimal_clusters,init='k-means++', n_init = 10 ,max_iter=300, 
12                        tol=0.0001,  random_state= 111  , algorithm='elkan') )
13kmeans_model.fit(X1)
14cdf['pred_cluster_kmeans'] = kmeans_model.labels_
15centroids = kmeans_model.cluster_centers_
16
17display(pd.DataFrame(cdf['pred_cluster_kmeans'].value_counts(normalize=True)))
18
19clf = DecisionTreeClassifier()
20# Train Decision Tree Classifer
21clf = clf.fit(X1, cdf['pred_cluster_kmeans'])
22
23# Predict the response for test dataset
24cdf['pred_class_dtree'] = clf.predict(X1)
25
26display(pd.crosstab(cdf['pred_cluster_kmeans'], cdf['pred_class_dtree']))
27dot_data = StringIO()
28export_graphviz(
29    decision_tree=clf, 
30    out_file=dot_data,  
31    filled=True, 
32    rounded=False,
33    impurity=False,
34    special_characters=True, 
35    feature_names=xcol_labels, 
36    class_names=cdf['pred_cluster_kmeans'].unique().astype(str).tolist(),
37
38)
39graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
40graph.write_png("./decisiontree.png")

What can you do with your new segments?

Now that we have our customer segments we can do all sorts of different things. We can create A/B tests for website experiences or we can test the impact of changing our prices to certain customers. In general, we can just try a bunch of new stuff.

How do I know if my segments are accurate?

The metric we used in the example above (i.e., within cluster sum-of-squares / inertia) was a reasonably straightforward way to measure the accuracy of your segments from an analytical perspective, but if you wanted to take a closer look, I'd recommend reviewing individual users in each segment. It sounds a little silly and can, in some cases, lead to the wrong conclusions but I firmly believe that in data science, you just have to really look at your data. You learn a lot from it.

How do I know when my segments need to change?

Lastly, segments can change; your customers are always evolving so it's good to re-evaluate your clusters time and again. The emergence of new segments should feel very obvious, since it may be driven by product or acquisition changes. As a concrete example, if you noticed that important businesss metrics split by your segments are starting to behave a little differently, then you can investigate whether it's driven by a change in the segments; sometimes it is, sometimes it's not.

Conclusion

This tutorial ended up being a little longer than I anticipated but oh well, I hope you enjoyed it.

I've stored the code to reproduce this example in a Jupyter Notebook available on my GitHub (note to render the interactive 3D visualization you have to run the notebook). To get it up and running you only need to download the notebook, download the data, install Docker, and simply run:

1docker run -it -p 8888:8888 -v ~/path/to/your/folder/:/home/jovyan/work --rm --name jupyter jupyter/scipy-notebook:17aba6048f44

And you should be good to go. Happy segmenting!

Have some feedback? Feel free to let me know!