(by Patrick Callahan and Dr. Steve Poulin)

In our CompassRed Data Lab, on behalf of our clients, we are always looking for a better way to “Predict” with our algorithms, including Machine Learning (ML), and Artificial Intelligence (AI). As data becomes more ubiquitous and complicated, and as the systems that manage the data become more fluid, the process and methodology become ever more important (as long as they’re flexible). 

Start with Data, End with Deployment

Sandwiched either side of the actual predictive process is of course(1) data identification and ingestion (what we like to call “Data Wrangling”) and (2) results deployment. Data Wrangling is probably the easiest step technically, but the most complicated step in reality. By some accounts it can be as much as 80% of the work when building and implementing a predictive analytics process. Deployment is the process of making the results usable - which could be done either through visuals (i.e. Dashboards), triggers, API’s, or whatever mechanism. More on this in future posts. 

The Model Development

When it comes down to doing the actual predictions (when we have the data in our hands) - we have a five-step process we continually hone to produce the predictors and the results we need:

Step 1) Determine the Unit of Analysis

Usually, the “unit of analysis” will be structured as a record in the database and is typically a “transaction” or “client”. Transaction records are often ideal, because they can be aggregated to client level records. Examples of transactions are “sales”, “check-ins”, “meeting occurrences”, “employment hires”, etc. A client can be a “customer”, “sales prospect”, or “person(s)”. When we’re working on a time series analysis, the unit of analysis is a period of time (e.g. day, week, month, or year). 

Step 2) Identify the Target Field(s)

The Target Fields are usually an outcome(s) of interest to the organization. They must include at least two or more values (e.g. sale or no sale, or voluntary termination, involuntary termination, or no termination). This means that you cannot used data in which all of the subjects had the same outcome (e.g. everyone responded to the sales offer).

Step 3) Identify Initial Predictors

When identifying the initial predictors, we identify all of the fields that could potentially affect the target fields. Often the challenge is that these fields are stored in different tables and different parts of the organization. Once these predictor fields have been identified the next challenge is oftento access and join them into a single table so that each record includes the target and predictor fields (hence the term “Data Wrangling”). In general, not all potential predictors need to be accessed immediately; more can easily be added later.

Step 4) Build The Models

This is where the fun begins. The “models” are the algorithms that capture historical patterns in the data. Most models are used to (a) rank the predictors for their impact on a target field, or (b) make predictions for new records (known as “scoring”). Today, the most widely used software programs for generating these algorithms are IBM SPSS, R and SAS, because they include many different modeling procedures. Because of costs - R usually the most used since it is Open. IBM’s SPSS is the most practiced, and SAS is generally known as the most robust (and expensive!). 

Step 5) Test and Modify Models

Finally, the developed Models are tested for their predictive accuracy on historical data. In this step, random samples of the historical data can treated as new cases to test a model’s ability to accurately predict their outcomes. All models have settings that can be adjusted to improve a model’s performance. 

Improving and Expanding the Predictive Analytics Process

There is not a model out there that cannot be improved. That’s why Predictive Analytics (or what we call “Predictive Intelligence”) is benefiting like no other from Machine Learning, Artificial Intelligence, and Deep Learning. Accessing more predictor fields will improve accuracy, although the cost of accessing these fields must be weighed against their contribution to this predictive accuracy. More target fields can be added to the process, each with their own set of predictors and models

Predictive Analytics has been around for years - but the new advances in methods and processing have catapulted it to the forefront. Asking one colleague recently, in a discussion comparing the popularity of AI to Blockchain, and whether AI was on the “Hype Curve”, he responded that Blockchain was a new concept and technology looking for a problem, but Artificial Intelligence, Predictive Analytics, and Machine Learning have been in practice for years. Because of a “Moore’s Law” type affect on these methods - new applications of these technologies are occurring that are life (or game) changing. Anyone for a game of poker? 

AuthorPatrick Callahan