*by Ryan Harrington*

Last week we discussed that there are three broad buckets of predictive analytics: clustering, prediction, and association. Using the techniques in each of these buckets allows for organizations to gain deep insights into the work that they do. Each bucket is a piece of the puzzle in building a model for a company. This week, we’ll discuss the first of those buckets: clustering.

**So, what is clustering?**

At its core, the idea behind clustering is straightforward – **to split up data into different groups. **The goal is to make sure that all of the data within the groups that are created are very similar to each other, but very different from the other groups.

To illustrate this, let’s take a look at data about something near and dear to all of our hearts: commute times. We’re sourcing the data from the 2015 American Community Survey. Specifically, we’ll look at the “Commuting Characteristics by Sex” table.

**How does clustering work?**

There are a few ways to think about clustering, but we’ll concentrate on two of the most common examples: **hierarchical clustering** and **centroid-based clustering**. Regardless of what type of clustering is being performed, the same few steps are always followed to start with:

**Select the appropriate data**

If we’re comparing data about commute times, then we want to make sure that the data that we are using for that task is appropriate. Before getting started, make sure that your data is appropriate. Clustering, just like any type of analytics, is susceptible to GIGO (garbage in, garbage out).

**Normalize the data**

In the case of our commuting data, everything is expressed in percentages, so it is already normalized. However, that is not always the case. Sometimes data can be expressed on very different scales. For example, if we included the total population of each state, that would be on a vastly different scale than percentages. To make sure that one feature isn’t overwhelming our analysis, everything should be normalized, or “scaled”, appropriately.

**Select an appropriate distance metric**

Clustering analysis is all about distances. That’s how we’re able to determine how closely related any two observations are. If you think back to high school math, you’ve actually already learned about some of the more important distance measurements. One of the most popular metrics is called “Euclidean distance”. If you can recall Pythagorean Theorem, that’s essentially what’s happening, but on a slightly more grand scale. There are other distance metrics that can be selected as well. Which one you choose depends upon how you want data to be clustered.

**What are some common clustering techniques?**

*Hierarchical Clustering*

At this point, you’ll need to select which clustering method you’d like to move forward with. Let’s look at hierarchical clustering first. This method of clustering creates a hierarchy within the data. This is done by using the distance metric that was selected, using it to find the distances between each of the observations, and then organizing the distances based upon how close together the data is. Our final product is called a “dendrogram”. For our commute data, the dendogram looks like this:

You’re probably thinking something like “wow, that’s a bit confusing”. Well, don’t worry. You’re not wrong in thinking that. Even with just 50 states (and Washington DC), the data becomes a bit overwhelming to look at. The hierarchical clustering technique is often only used for relatively small datasets. Can you imagine if instead of looking at the 50 states in the US we chose to look at the 3007 counties instead? This method wouldn’t be particularly helpful for that. Instead, it is much more common to use centroid-based clustering.

*Centroid-Based Clustering*

In hierarchical clustering, we compared the distances of every observation to *every other* observation. We don’t do this for centroid-based clustering. Instead, we compare each observation to a set number of…you guessed it…*centroids*. You probably have a hunch as to what a centroid is. Simply put, it is a point that represents the center of the data. For centroid-based clustering, there are two questions that you need to ask yourself:

**How many centroids do I want to use?**

You can select*as many or as few centroids as you’d like*, though typically people choose a relatively small amount. For our commuting data, if we choose one centroid, then every state will be considered part of that group. On the other hand, if you choose 51 centroids, then every state will be part of its own group. Obviously, neither of these scenarios is particularly useful. Instead, we want to select a small number of groups – perhaps somewhere between 3 and 5. This would split our data up into a manageable number of groups (or segments for the marketers reading this). For any dataset, there is an optimal number of clusters to select. If you're curious to learn more about that, take a look at this great post (fair warning: it's a bit technical).

**How do I determine where the centroids should be?**

The beauty of centroid-based clustering techniques is that they are driven by algorithms. Centroids are initially chosen*randomly*. From there, the algorithm assigns each observation to one of the centroids, creating groups. The centroids of these groups are then found and the assignment process occurs again. This process occurs iteratively until the centroids do not change between iterations. You can find a deeper explanation here (fair warning again: also technical). At this point, the final centroids and groups are selected.

For anyone who's more of a visual learner, here’s a great example of what this process looks like. Below is a plot of 5000 randomly generated data points. We want to use *5 centroids* in order to create *5 groups* out of the data.

Initially the centroids are randomly assigned. Data points are grouped with the centroids based upon which centroid they are closest to. A new centroid is then determined based upon this group. The whole process now occurs again. This process stops once the centroids stops moving. This happens around the 30th iteration.

It’s quite easy to see natural groupings for 2- or 3-dimensional data. However, most datasets are *much* more complex than this. Our commuting data, for example, is being compared on ** 56** different features. It would be extremely difficult for a person to create groups based upon data in this many dimensions, but centroid-based clustering is able to do it quickly.

Here’s what the different clusters would look like for commuting characteristics for between 2 and 5 clusters. You can see that we do not gain a significant amount of information from 2 clusters, but as we increase the number of clusters we get progressively more information. If you're having difficulty viewing the embedded version of the maps, try viewing them directly here.

**Great! Now how can my company actually use clustering?**

While your company might not be particularly interested in commuting characteristics for each of the different states, there are a huge number of ways that clustering can be used. It is an extremely versatile predictive analytics technique with applications in every industry.

We previously mentioned that you could think of the groups formed by clustering centroids as “segments”. In a business environment, that is one of the most common examples of cluster analysis – segmenting customers. Once customers have been segmented, they can be better targeted, whether for customer retention or acquiring future customers. Outside of the business world, clustering has been used to identify everything from areas of the world with similar climates, to websites with similar audiences, to genes with similar functionality. It is a powerful predictive tool that allows organizations to make better decisions using their data.

Next week, we’ll take a look at another tool in the predictive analytics toolset: **association analysis**. While clustering can help businesses to better identify customer segments, association analysis helps companies understand what products are likely to be sold together.

Want to know more about how to use predictive analytics for your business? Let us know. We’ll help you to put these tools to good use.