Warning: this content is older than 365 days. It may be out of date and no longer relevant.

The Predictive Analytics Process- Picking Variables

In the last post, we examined different ways to prepare data to counteract known, common problems. Let’s turn our eye towards picking which data to predict.

The Predictive Analytics Process: Picking Variables 1

Picking Variables

Picking a variable to predict is a blend of both human insight and machine analysis. The best comparison I know is that of a GPS app. We have lots of choices on our smartphones about which mapping application to use as a GPS, such as Apple Maps, Google Maps, and Waze. All three use different techniques, different algorithms to determine the best way to reach a destination.

Regardless of which technology we use, we still need to provide the destination. The GPS will route us to our destination, but if we provide none, then it’s just a map with interesting things around us.

To extend the analogy, we must know the business target we’re modeling. Are we responsible for new lead generation? For eCommerce sales? For happy customers?

Picking Dependent Variables

Once we know the business target, the metric of greatest overall importance, we must isolate the contributing dependent variables that potentially feed into it. Any number of marketing attribution tools perform this, from Google Analytics built-in attribution modeling to the random forests technique we described in the previous post.

As with many statistical methods, attribution provides us with correlations between different variables, and the first rule of statistics – correlation is not causation – applies. How do we test for correlation?

Testing Dependencies

Once we’ve determined the dependent variables that show a high correlation to our business outcome, we must construct tests to determine causality. We can approach testing in one of two ways (which are not mutually exclusive – do both!): back-testing and forward-testing.


Back-testing uses all our existing historical data and runs probabilistic models on that data. One of the most common ways to do this is with a technique called Markov chains, a form of machine learning.

markov chain attribution model

What this method does is essentially swap in and out variables and data to determine what the impact on the final numbers would be, in the past. Think of it like statistical Jenga – what different combinations of data work and don’t work?


Forward-testing uses software like Google Optimize and other testing suites to set up test variations on our digital properties. If we believe, for example, that traffic from Twitter is a causative contributor to conversions, testing software would let us optimize that stream of traffic. Increases in the effectiveness of Twitter’s audience would then have follow-on effects to conversions if Twitter’s correlation was also causation. No change in conversions downstream from Twitter would indicate that the correlation doesn’t have obvious causative impact.

Ready to Predict

Once we’ve identified not only the business metric but also the most important dependent variable, we are finally ready to run an actual prediction. Stay tuned in the next post as we take the predictive plunge.

You might also enjoy:

Want to read more like this from Christopher Penn? Get updates here:

subscribe to my newsletter here

AI for Marketers Book
Take my Generative AI for Marketers course!

Analytics for Marketers Discussion Group
Join my Analytics for Marketers Slack Group!


Leave a Reply

Your email address will not be published. Required fields are marked *

Pin It on Pinterest

Share This