You Ask, I Answer: Statistical Significance in A/B Testing?

Warning: this content is older than 365 days. It may be out of date and no longer relevant.

You Ask, I Answer: Statistical Significance in A/B Testing?

Wanda asks, “How do I know if my A/B test is statistically significant?”

Statistical significance requires understanding two important things: first, is there a difference that’s meaningful (as opposed to random noise) in your results, and second, is your result set large enough? Watch the video for a short walkthrough.

You Ask, I Answer: Statistical Significance in A/B Testing?

Watch this video on YouTube.

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.

In today’s episode, Wanda asks, How do I know if my AB test is statistically significant? This is a very good very common question, particularly with things like web page optimization, and email marketing and even social media marketing.

What happens is we’ll get results back in fact, let’s do this.

Let’s, let’s show you an example here.

This is an AB test I did yesterday.

And what we see here is I sent out an email to different subject lines, subject line a was eight steps to improving your LinkedIn profile and B was a question is your LinkedIn profile working working for you? And we see in my marketing automation software here, this a here has been marked as a winner right? Let’s look at the stats.

When we look at the stats, we see a was sent to 330 9000 574 people B was sent to 39,573 people.

So far so good.

A got 3990 opens, which is what I was measuring on B got 3873 opens.

So A is the winner, or is it? To answer Wanda’s question.

This is a very common scenario.

What we need to do is we need to do some statistical testing we needed to do what is called a proportion test versus a two sided test.

And then we need to do a power test to figure out whether our sample here is large enough.

Some basic stats, what we’re talking about these statistical significant tests, significance tests, what we’re talking about is is there enough of a difference between a and b, that it could not have happened randomly? It could not have happened by chance.

You know, is there a difference enough in the audience that you could measure it and say, yeah, this is not chance this did not happen by accident was a real impact.

Or could this have been noise is there enough of a difference that that’s so small that it could have been random.

And that’s really what we want to find out.

Because if we want to make a judgement about his subject line A or B better, we need to know if a, in this case, which is the winner, really actually one was luck of the draw.

To do this, there are a number of different ways that you can tackle this in every math and stats program available, I’m going to use the programming language are here.

Again, there’s there’s even web calculators for some of the stuff I just like to do, because it’s super compact.

I have my a population, which is the number of people sent it to and the number of opens that a got.

I got my B population here and the conversions and I’m going to run that proportion test.

What I’m Looking for is this number right here, the p value, a p value of under 0.05 means that there’s a difference.

There’s a big enough difference between the two, that, gosh, something has happened that is meaningful.

Here.

It’s above point 05.

It’s at point 164.

So there, these two audiences maybe have behaved the same, which means that a didn’t necessarily win.

Now, just to show you an example, what if I take B down to 3400 conversions? Right? If I do that exact same test, and run it, we see that the p value shrinks to an infinitesimally small number, ie minus 10.

Right? That’s a you know 10 zeros before the decimal.

That is a clear difference that the the result was not random luck, but when in doubt, crank up B to 3900 make us super close, right? Watch what happens.

Point 851, that that P value has gone up even higher.

And so just with this very simple mathematical test, we can determine that in this case.

The the test itself was not statistically significant.

Now, here’s the other catch.

One of the things that goes wrong with a lot of AV tests, particularly with social media marketing, is that there’s not enough of a result to know.

So in this test, we see here about 10% of people opened the email in each in each category.

Is that big enough? Is that a meaningfully large enough size of the audience to tell.

To do this, we’re going to run this power test.

And the power test says that out of 3900 people in order to achieve a minor measurable effect of some kind, I would need to have at least 200 People take action, which is that n two number there.

If I did this test and you know 39 people clicked on a and 38 people clicked on B, would that be enough to judge whether there was a winner? The answer is no, because there’s not enough people who have been sampled to give that determination.

I need to I need to have at least, you know, call 200 rounded up 200 people in order to know Yes, this is a real significant value.

This is really important, because why don’t we talk a lot about you know, smaller population, smaller populations need bigger samples.

So let’s say that I want to talk about the fortune 500 how many people know how many CEOs in the fortune 500? Do I need to survey in order to get a meaningful result? 322 of them, right, because it’s such a small population, that there’s a variation.

That could be another variation and just a few people to really throw things so in this case, I would have to survey basically 60% of this very small population to know, yep, there’s a real thing here, the larger the population gets, assuming it’s, you know, well sampled, the smaller my sample size needs to be with regard to that population in order to get a statistically meaningful result.

Because again, that could be small variations in a very small population that could have a really big changes, as opposed to a bigger population, where you’re going to have more of a evenly distributed result.

My friend Tom Webster likes to call this like soup, right in a large population.

If the POTUS stirred well enough, a spoonful can tell you all he needs to know about the soup, but if you’ve got like a gumbo or a stew, you know once we want to have like a huge chunk of beef and then the like you would draw the conclusion this pot is full of beef.

Well, no, it’s not just happen to have a very lumpy sample there.

And so because it’s smaller, that those lumps could could be more confusing.

So the composition of the entire soup pot.

So these are the two tests you need to run.

And again, there are plenty of web calculators out there that do this stuff.

The challenge is here, a lot of them don’t do the second part, they don’t do the power test to determine whether your sample was big enough in the first place, they just do the first part.

So know that.

And in this case, if you can use the programming language, or SPSS or SAS or Stata, or any of these stats tools, do so because you will get better answers out of them as long as you can know what you’re interpreting.

But that’s how you know if your test is statistically significant, it’s big enough sample and meaningful enough difference.

If you have follow up questions about this or anything else, please leave them in the comments box below.

Subscribe to the YouTube channel on the newsletter.

I’ll talk to you soon.

Take care.

want help solving your company’s data analytics and digital marketing problems? Visit Trust insights.ai today and let us know how we can help you

Machine-Generated Transcript

Comments

Leave a Reply Cancel reply

Pin It on Pinterest