Warning: this content is older than 365 days. It may be out of date and no longer relevant.

You Ask, I Answer: Detecting Bias in Third Party Datasets?

Jim asks, “Are there any resources that evaluate marketing platforms on the basis of how much racial and gender bias is inherent in digital ad platforms?”

Not that I know of, mostly because in order to make that determination, you’d need access to the underlying data. What you can do is validate whether your particular audience has a bias in it, using collected first party data.

If you’d like to learn more on the topic, take my course on Bias in AI at the Marketing AI Academy.

You Ask, I Answer: Detecting Bias in Third Party Datasets?

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.

Veeam in today’s episode, Jim asks, Are there any resources that evaluate marketing platforms on the basis of how much racial and gender biases inherent in digital ad platforms? So Not that I know of, mostly because in order to make a determination about the bias of a platform, you need to look at three different things, right, you need to look at the data set that’s gone in it, the algorithms that have been chosen to run against that.

And ultimately, the model that these these machine platforms use in order to generate results.

And no surprise, the big players like Facebook or Google or whatever, have little to no interest in sharing their underlying data sets because that literally is the secret sauce.

Their data is what gives their machine learning models value.

So what do you do if you are concerned that the platforms that you’re dealing with may have bias of some in them, well first, acknowledge that they absolutely have bias.

And then because they are trained on human data and humans have biases.

For the purposes of this discussion, let’s focus on.

Let’s focus on the machine definition of bias, right? Because there’s a lot of human definitions.

The machine or statistical definition is that a bias is if something is calculated in a way that is systematically different than the population being estimated, right? So if you have a population, for example, that is 5050.

And your data set is 6044.

At any statistic, you have a bias, right? It is systematically different than the population you’re looking at.

Now, there are some biases, that that’s fine, right? Because they’re not what are called protected classes.

If you happen to cater to say people who own Tesla cars, right? Not everybody in the population has a Tesla car.

And so if your database is unusually overweight in that aspect, that’s okay that is a bias, but it is not one that is protected.

This actually is a lovely list here of what are considered protected classes, right? We have race, creed or religion, national origin, ancestry, gender, age, physical and mental disability, veteran status, genetic information and citizenship.

These are the things that are protected against bias legally in the United States of America.

Now, your laws in your country may differ depending on where you are.

But these are the ones that are protected in the US.

And because companies like Facebook and Google and stuff are predominantly us base, headquartered here, and are a lot of their data science teams and such are located in the United States.

These are at the minimum the things that should be protected.

Again, your country, your locality, like the EU, for example.

may have additional things that are also prohibited.

So what do we do with this information? How do we determine if we’re dealing with some kind of bias? Well, this is an easy tools to get started with right, knowing that these are some of the characteristics.

Let’s take Facebook, for example, Facebook’s Audience Insights tells us a lot about who our audience is.

So there are some basic characteristics.

Let’s go ahead and bring up this year.

This is people who are connected to my personal Facebook page and looking at age and gender relationship and education level.

Remember that things like relationship status and education level are not protected classes, but it still might be good to know that there is a bias that the the, my data set is statistically different than the underlying data.

Right.

So here we see for example, in my data set, I have zero percent males between the ages of 25 and 34.

Whereas the general population there is going to be like, you know, 45% of give or take, we see that my, in the 45 to 54 bracket, I am 50% of that group there.

So there’s definite bias towards men there, there is a bias towards women in the 35 to 50 to 44 set is a bias towards women in the 55 to 64 set.

So you can see in this data, that there are differences from the underlying all Facebook population, this tells me that there is a bias in my pages data now, is that meaningful? Maybe, is that something that I should be calibrating my marketing on? No, because again, gender and age are protected classes.

And I probably should not be creating content that or doing things that potentially could leverage one of these protected classes in a way that is illegal.

Now, that said, If your product is or services aimed at a specific demographic like I sold, I don’t know, wrenches, right, statistically, there’s probably gonna be more men in general, who would be interested in wrenches than women.

not totally.

But enough, that would be a difference.

In that case, I’d want to look at the underlying population, see if I could calibrate it against the interests to see it not the Facebook population as a whole.

But the category that I’m in to make sure that I’m behaving in a way that is representative of the population from a data perspective.

This data exists.

It’s not just Facebook.

So this is from I can’t remember what IPAM stands for.

It’s the University of Minnesota.

they ingest population data from the US Census Bureau Current Population Survey.

It’s micro data that comes out every month.

And one of the things you can do is you can go in and use their little shopping tool to pull out all sorts of age and demographic variables including industry, and what you weren’t, you know, and class of worker, you can use this information.

It’s anonymized.

So you’re not going to violate anyone’s personally identifiable information, but synonymous.

And what you would do is you would extract the information from here, it’s free look at your industry, and get a sense for things like age and gender and race and marital status, veteran status, disability, and for your industry get a sense of what is the population.

Now, you can and should make an argument that there will be some industries where there is a substantial skew already from the general population, for example, programming skews unusually heavily male.

And this is for a variety of reasons we’re not going to go into right now but acknowledge that that’s a thing.

And so one of the things you have to do when you’re evaluating this data and then making decisions on is, is the skew acceptable and is the skewed protected, right? So in the case of, for example, marital status marital status is not a protected class.

So is that something that if your database skews one way or the other doesn’t matter? Probably not.

Is it material to your business where we sell, for example, Trust Insights, sells marketing insights, completely immaterial.

So we can just ignore it.

If you sell things like say wedding bands, marital status might be something you’d want to know.

Because there’s a good chance at some of your customers.

Not everybody goes and buys new rings all the time.

Typically, it’s a purchase happens very, very early on in a long lasting marriage.

On the other hand, age, gender, race that are those are absolutely protected classes.

So you want to see is there a skew in your industry compared to the general population and then is that skew acceptable? If you are hiring, that skews not acceptable, right? You cannot hire for a specific race.

Not allowed.

You cannot have For a specific age, not allowed.

So a lot of this understanding will help you calibrate your data.

Once you have the data from the CPS survey, you would then take it and look at your first party data and like your CRM software, your marketing automation software, if you have the information.

And if you have that information, then you can start to make the analysis.

Is my data different than our target population? Which is the group we’re drawing from? Is that allowed? And is it materially harmful in some way? So that’s how I would approach this.

It’s a big project and it is a project that is you have to approach very, very carefully and with legal counsel, I would say, if you are, if you suspect that you have a bias and that that bias may be materially harmful to your audience, you should approach it with legal counsel so that you protect yourself you protect your customers, you protect the audience you serve, and you make sure you’re doing things the right way.

I am not a lawyer.

So good question.

We could spend a whole lot of time on this.

But there’s there’s a lot to unpack here, but this is a good place to start.

Start with populate Population Survey data.

Start with the data that these tools give you already and look for drift between your population and the population you’re sampling from your follow up questions, leave them in the comments box below.

Subscribe to the YouTube channel in the newsletter, I’ll talk to you soon take care.

One helps solving your company’s data analytics and digital marketing problems.

Visit Trust insights.ai today and let us know how we can help you


You might also enjoy:


Want to read more like this from Christopher Penn? Get updates here:

subscribe to my newsletter here


AI for Marketers Book
Take my Generative AI for Marketers course!

Analytics for Marketers Discussion Group
Join my Analytics for Marketers Slack Group!