by Ryan Harrington
Over the past two weeks we discussed two techniques that help to find patterns within data – clustering and association rules analysis. Clustering helps to split data into groups that are similar to each other. Association rules help to find items that are commonly grouped together. On their own, these techniques are powerful and could help any business to make better strategic decisions. While these techniques help you to mine your data – to understand the patterns within it – they fail to make any predictions about what will happen in the future. That’s where the last set of techniques come into play, aptly named predictive analysis.
That means it can tell me all about my future – right?
Not quite, though this set of techniques can do really well with certain types of problems. There are two broad techniques that fall under predictive modeling, classification and regression. There is one key difference between the two broad techniques. Predictions for classification techniques will identify group membership whereas predictions for regression techniques will estimate a response.
To illustrate these differences, let’s consider the example of a telecommunications company. The company has plenty of data about its customers. This includes everything from information about the products that their customers have purchased to demographic information (such as age, gender, and location). There are two questions that the company would like answered based upon this information:
- Which customers will end their contract (churn)?
- How much revenue can they expect from a new customer?
The first of these questions, “Which customers will churn?” is an example of a classification problem. Every customer in the dataset can be labeled as either someone who continued with their contract or ended it. The outcome is binary. One of the classes can be labeled as “0” and the other can be labeled as “1”.
On the other hand, the second question, “How much revenue can they expect from a new customer?” has a much broader range of outcomes. Some customers might make the company quite a bit of money each month, whereas others will make very little. There are an infinite, continuous number of possible outcomes.
What do I need in order to perform predictive modeling?
For both classification and regression techniques, most of what needs to be done to prepare is similar. In fact, it’s very similar to the steps that were needed for clustering techniques.
- Select the appropriate data
Often there will be a large amount of data to select from. Depending upon the type of analysis that is being performed, some data might be more useful than others. Just like with clustering, garbage data in will lead to garbage data out.
- Normalize the data
Data comes in all different forms. Sometimes data is categorical. Sometimes data is continuous. All of this data needs to be handled differently.
- Select a target field
Unlike with clustering and association modeling, we need to find a field that we are actively trying to predict. For classification models, we would be looking for a field that is binary. In our telecommunications example, we should have a field called “Churn” where every previous customer has a classification. Importantly, both classes (churns vs. does not churn) need to be present for a prediction to be made. For regression models, we should have a field with a continuous value. Going back to our telecommunications example again, we should have a field called “Revenue” that includes a value for every previous customer.
What information do these models provide?
Each predictive model provides information that can be extremely powerful.
For regression, the output for each model provides an expected value. Referring to our telecommunications example, we can predict revenue for each customer. Our output will be a value which represents the same information as our target field. Because we selected a target field that represented revenue, the output will also represent revenue. Technically, any value is possible outcome.
For classification, the output for each model provides information about which of the two classes the information falls into. The different classification techniques provide this information in the form of a propensity score. Each observation is given an output between 0 and 1. This represents the percent likelihood that the observation will fall into the class that we labeled as “1”. For example, in our churn question, churning could be evaluated as a “1” and not churning labeled as a “0”. A new customer that is run through the classification model might have a final output, or propensity score, of “0.572”. This can be interpreted as there is a 57.2% likelihood of the customer churning in the future.
Besides the obvious outcomes, each of these models also provides more useful information. In creating the outputs, the models also determine which of the fields contribute the most to the outcomes. For our churn model, we might see an example that looks like the following:
This bar plot shows different features of the dataset along the Y-axis and their relative importance as a percentage along the X-axis. We can interpret this to mean that customers who have purchased Handset_ASAD90 have a relatively high likelihood of churning. There are a variety of different customer characteristics present – from how many international minutes to the type of tariff that the customer pays.
So, how can my company use this information to make decisions?
Predictive modeling has a broad range of industries that it can be useful for. Many industries have use for being able to classify or predict values. Here are some examples:
- Prioritizing people sales calls
Many companies use call centers for a large variety of reasons. One important function is to call potential sales leads. Classification models can be built in order to determine which sales leads are most likely to convert into a customer. The propensity scores that come from the modeling process can then be ordered from most likely to become a customer to least likely. This can save the company valuable resources.
- Insurance risk
Insurance companies capture a broad range of information about their customers. This information can be used to generate risk scores for customers which can then be used to help companies price their products appropriately.
- Determining appropriate salaries
It can be extremely difficult for a person to determine what an appropriate salary for them might be. This can deeply complicate the job search process. Predictive models can be employed to provide an appropriate range of output salaries based upon a person’s background and skillsets. This is also useful for companies, especially as they look to begin building talent in specific areas, such as data science.
Predictive modeling can truly provide a broad range of useful information for organizations interested in employing the techniques. When compared with association modeling and clustering techniques, data can provide truly valuable insights, only limited by the quality of data available and the creativity of the analyst.