Causation can exist without correlation

If you’ve been reading this blog for any amount of time or hanging around myself, Tom Webster, Jay Baer, and the many other numbers-focused folks in marketing, you’ve undoubtedly heard the expression, “correlation does not equal causation”. This is an axiom of basic statistical analysis, and if for some reason this is the first time you’re hearing it, then please go read this.

One of the assumptions that lots of folks (including myself) have at a certain point in statistics is that while correlation does not equal causation, causation cannot exist without correlation. However, it turns out that this isn’t true. Causation can exist without correlation!

How is this possible, when a relationship would seem to be mandatory in order for a causal relationship to be present? It’s deceptively simple, and boils down to how you select data. Let’s take a fictitious example; let’s say that I worked for an alcohol company, and I wanted to prove that alcohol does not cause motor vehicle fatalities. For clarity’s sake, neither are true – I don’t work for an alcohol company, and driving while intoxicated is blatantly unsafe. Don’t do it.

If I were to do a correlation of a random, representative sampling of people, some of whom drank alcohol and operated a vehicle unsafely, and some of whom did not, you would indeed see that there’s a strong relationship between alcohol consumption and vehicular fatalities. That would seem to indicate that correlation was mandatory in order for there to be causation.

Screenshot_11_10_14__6_52_AM

But suppose I restricted my “study” to people who were, in my inexpert opinion, most likely to drive drunk. Suppose I focused it only on people who had 10 or more drinks per day? What you might find would be a negative correlation, that in fact, the more you drink, the less likely it is you’ll die from drunk driving, and therefore driving while drunk must be safe. What’s really happening among that population of super-heavy drinkers? They’re likely dying of causes other than drunk driving. At 10+ drinks a day, that’s not too hard to imagine.

The reality is that by selecting a population with no variation – that is, no one in the study did NOT drink – you can create distortions in your data that can “prove” your point, even though they’re statistically invalid. We know, beyond a shadow of a doubt, that drinking alcohol does cause an increase in vehicular deaths, but the data can be manipulated to “prove” otherwise.

While the above is an extreme example, there are plenty of times marketers make this mistake. Any time you do a survey or study of your customers, you are automatically reducing variation. You’re not surveying people who are NOT your customers. While surveying only your customers makes a great deal of sense if you want to understand how customers feel about your products or services, surveying only your customers to get a sense of the industry can create the same distortions as the alcohol and drunk driving example above. You’re only “proving” that your data has insufficient variation, and that there may be a very obvious causal relationship that you’re missing entirely.

Keep this example in mind as you read through surveys, infographics, etc. in the coming months. There will be a great deal of “marketers believe in 2015″ or “marketers found in 2014″ headlines – but check to see how the survey was taken. If it’s a survey of customers or someone’s email list, question the daylights out of it before you go believing it and making any changes to your business.


If you enjoyed this, please share it with your network!


Want to read more like this from ? Get daily updates now:


Get my book!

Subscribe to my free newsletter!


Hidden analytics traps: percent change

Quick, take a look at this performance chart of percent change in your analytics:

Screenshot_11_3_14__6_38_AM

Now tell me, is the person responsible for this getting fired?

Obviously, based on the title of this post, you might be a little more cautious about how you answer that question – but the average manager, director, VP, or C-suite executive might not be.

Okay, second performance chart for you to take a look at:

Screenshot_11_3_14__6_43_AM

So, what do you think? Is the person in charge of revenue here getting fired or promoted?

If you’re a rational business leader, the up-and-to-the-right nature of this graph obviously says that the person in charge of it is doing a good job.

Now…

What_if_I_told_you_They_were_the_same_data__-_What_if_I_told_you___Matrix_Morpheus___Meme_Generator

This is the hidden danger of percentage change calculations. They’re useful for understanding how much something has grown, but they can be skewed significantly if you’re talking about big jumps relative to the size of the data. The difference between 1,000 and 1,001 is the same in absolute terms as the difference between 0 and 1, but the latter is an infinitely bigger jump.

This is why you need to look at absolute data whenever you’re looking at percentage change data. It doesn’t matter whether you’re talking about Twitter followers, lead generation, ROI, or company revenue – make this a standard rule to practice. If a vendor, supplier, subordinate, or peer comes to you with only percentage change data, ask them with vigor and confidence to also see the underlying data, otherwise you may be getting only part of the story (and likely the part of the story that makes them look good).


If you enjoyed this, please share it with your network!


Want to read more like this from ? Get daily updates now:


Get my book!

Subscribe to my free newsletter!


Average, median, and marketing analytics

If you’ve never taken a statistics class, yet you’ve ended up being responsible for your company’s marketing analytics, then this blog post is for you.

One of the core statistical concepts we rarely hear about is the median. We hear about averages all of the time: average revenue per user, average website traffic, average number of new followers gained. But here’s the thing about averages – and any statistic, for that matter – sometimes they don’t tell the whole story.

About the only time the average person even hears the word median (besides when they drive in it on the highway) is from politicians when they talk about median income.

Broadly defined, an average is when you take the sum of all of the numbers in a data set and divide by the number of things in the data set to look for a central value. For example, let’s take the numbers 10-20. There are 10 numbers: 10, 11, 12, 13… etc. There are 11 items in the data set. Add up all the numbers and you get 165, then divide by 11, and you get the average, 15.

Broadly defined, a median is when you find the middlemost number in a data set. In the same example data set above, the median is also 15. It’s right in the middle.

Here’s how average can mess up, using a very often-cited example. Imagine you’re in a bar with 10 of your friends. The average income in the bar, let’s call it $50,000. The median income is $50,000. Now Bill Gates walks in. The average income in the bar skyrockets to $5 million. Is everyone in the bar richer? Should the restaurant change its pricing because the average income of the patrons is so much higher? Of course not. The median income stays the same, but the average gets skewed because of an outlier.

Here’s the thing: digital marketing is FILLED with outliers. If we want to measure accurately, we have to deal with them – and that’s why median is important. Medians help to give a second perspective on the same data, one that can sometimes deal with outliers a little better.

Let’s look at this chart of my personal website’s analytics, focusing on the month of October.

Audience_Overview_-_Google_Analytics

If we do the math, the average daily website traffic on my site is 410 visitors a day this month.

Let’s chart that out. Does that look right to you?

Screenshot_10_23_14__7_30_AM

It doesn’t to me. There are more parts of the blue line below the red than above the red, and if an average is supposed to help me find the middle, it’s not necessarily doing the best job in this case.

Now what if we put the median on here, which is 393 visitors a day:

Screenshot_10_23_14__7_33_AM

There is a difference. That big spike drove up the average, but the median remained relatively resistant to it.

If I’m trying to budget for personnel, for advertising, for anything that relies on web traffic, which number should I plan around? I’d use the median, because it’s more representative of the typical day on my website than the average, in this case.

Keep the median in your toolbox and when you’re doing analysis and reporting on any series of data in marketing that calls for an average, calculate the median at the same time. It may shine some light on what’s going on in your data.


If you enjoyed this, please share it with your network!


Want to read more like this from ? Get daily updates now:


Get my book!

Subscribe to my free newsletter!