Select Page

In the last post, we examined different ways to prepare data to counteract known, common problems. Let’s turn our eye towards picking which data to predict.

## Picking Variables

Picking a variable to predict is a blend of both human insight and machine analysis. The best comparison I know is that of a GPS app. We have lots of choices on our smartphones about which mapping application to use as a GPS, such as Apple Maps, Google Maps, and Waze. All three use different techniques, different algorithms to determine the best way to reach a destination.

Regardless of which technology we use, we still need to provide the destination. The GPS will route us to our destination, but if we provide none, then it’s just a map with interesting things around us.

To extend the analogy, we must know the business target we’re modeling. Are we responsible for new lead generation? For eCommerce sales? For happy customers?

## Picking Dependent Variables

Once we know the business target, the metric of greatest overall importance, we must isolate the contributing dependent variables that potentially feed into it. Any number of marketing attribution tools perform this, from Google Analytics built-in attribution modeling to the random forests technique we described in the previous post.

As with many statistical methods, attribution provides us with correlations between different variables, and the first rule of statistics – correlation is not causation – applies. How do we test for correlation?

## Testing Dependencies

Once we’ve determined the dependent variables that show a high correlation to our business outcome, we must construct tests to determine causality. We can approach testing in one of two ways (which are not mutually exclusive – do both!): back-testing and forward-testing.

### Back-Testing

Back-testing uses all our existing historical data and runs probabilistic models on that data. One of the most common ways to do this is with a technique called Markov chains, a form of machine learning.

What this method does is essentially swap in and out variables and data to determine what the impact on the final numbers would be, in the past. Think of it like statistical Jenga – what different combinations of data work and don’t work?

### Forward-Testing

Forward-testing uses software like Google Optimize and other testing suites to set up test variations on our digital properties. If we believe, for example, that traffic from Twitter is a causative contributor to conversions, testing software would let us optimize that stream of traffic. Increases in the effectiveness of Twitter’s audience would then have follow-on effects to conversions if Twitter’s correlation was also causation. No change in conversions downstream from Twitter would indicate that the correlation doesn’t have obvious causative impact.

Once we’ve identified not only the business metric but also the most important dependent variable, we are finally ready to run an actual prediction. Stay tuned in the next post as we take the predictive plunge.

You might also enjoy:

Want to read more like this from Christopher Penn? Get updates here:

Shares