Warning: this content is older than 365 days. It may be out of date and no longer relevant.

You Ask, I Answer: Third Party Data and Trustworthiness?

Ashley asks, “If you choose to use public datasets for your ML models, like from Amazon or Google, can you trust that those are free of bias?”

Can you trust a nutrition label on a food product? The analogy is the same. What’s in the box is important, but what went into the box is also important. Trust is also proportional to risk.

You Ask, I Answer: Third Party Data and Trustworthiness?

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.

In today’s episode, Ashley asks, If you choose to use public datasets for your machine learning models, like from Amazon or Google, can you trust that those are free of bias? Hmm.

Well, so there’s a couple different things here.

companies like Amazon and Google don’t typically offer data sets.

What they do offer are either models or API’s of some kinds.

So Amazon, for example, has recognition and sage maker and all these things.

These have API’s behind them.

They have pre trained models.

Google, many of the services in Google Cloud perform the same way.

For example, Google speech to text, things like that.

Google also does release actual models themselves, like the T five transformer library, which you can install into like Google colab or your local Python environment and use their pre trained models.

And then there Yes, For example, Google datasets does offered raw data.

Now, let’s talk about that.

Can you trust that these are free of bias? In a word? No.

In order you cannot blindly trust anyone’s machine learning models data is to be free of bias because you don’t know what’s in it.

So as an analogy, suppose that you have a jar of jalapenos, right? A nutrition label that has five calories per serving.

And what would I expect to see in this jar jalapenos, right as an ingredient and probably vinegar and water, right? Because that’s maybe some salt.

That’s what’s in here.

Can I trust that if I just look the label alone that that’s what I’m getting? Well, when I look at this, I go to jalapenos water vinegar, salt, dehydrated onions, dehydrated garlic, calcium chloride is a firming agent.

sodium benzoate.

is a preservative polysorbate 80 which is that in their tumeric for color, why is polysorbate 80 in here, you don’t need an emulsifier For, for peppers in a jar anyway.

Can I trust? What’s on the label? Can I trust that what’s on the label is is accurate.

For example, we had a dolloping, where they jalapenos grown, where they grown free of most no harmful pesticides.

This case this jar is not labeled organic, so probably not.

On the other hand, if you were in the EU and you had this exact same product, could you trust that it was free of pesticides? Yes, much more so because EU regulations for foods are much more stringent than the United States.

The same analogy applies to machine learning and data science.

What the model says is important but also what went into the model to make the model is just as important to be free of bias to be free of both kinds of bias both human and statistical.

There are for example, any number of cases Is of bias that was unintentional.

Somebody did not mean for the dataset to be biased or did not mean for their mouth to be biased, but it was because they didn’t do any due diligence when putting it together.

Most probably famous case of this is Amazon when it attempted to build a hiring AI to screen resumes.

They trained it.

They weren’t looking for bias, and the model stopped hiring women.

Right? Because nobody did any checks.

So what’s the solution? Can you build from these systems and trust them? Well, there’s two different ways to handle this first.

The first is to build your own model, which is expensive and time consuming, but it is the only guarantee that the data going into it is trustworthy because you will have vetted it and made it trustworthy and tested it.

If you are somewhat familiar with Python.

IBM has the fairness 360 toolkit which is a phenomenal toolkit, totally free, totally free to test datasets for bias.

And if you are building your own model, you would use that to validate your data before the model is constructed.

And then you can be reasonably sure that your model is free of at least of data going in being biased, you do still have to monitor it, you still do have to have for example, the protected classes that you’re monitoring for declared.

And you still do have to ensure that the model when it’s running is not drifting out of the rails that you set for it the guardrails.

For example, if you said that you know, gender must be 5050 split for 4040 1040 4020 then you would have to monitor and say okay, how far outside is acceptable, you know, is a 1% drift acceptable is a 5% drift is acceptable At what point To say, Hey, we need to either ran the model back and retrain it, or balance it in some way to get it back on the rails.

So that’s one aspect is the, you have to build it yourself and train it and monitor it to is a risk judgment.

Try trusting a model is proportional to the risk that you’re incurring with the model.

So, if I am building a machine learning model to recognize sentiment in tweets, how vitally important is that going to be? No one’s probably going to die.

If I’m using it for like social media engagement monitoring, probably nobody’s going to die.

I might make some bad judgment calls, I could cause some damage to a brand.

But for the most part, it’s not super serious.

On the other hand, if I am producing, say, a new vaccine it had better be really, really, really biased It better be really representative had a better be really any model I built to try and assess the efficacy of something or identify a drug candidate had better be pristine and it’s freedom from bias because it could actually kill people right the risk level is substantially higher.

So, the the standards that we must hold that model to are much more stringent facial recognition for say like at a tradeshow booth, relatively low risk, right if you miss identify somebody for you know, as as a gimmick to attract people to your tradeshow booth, not huge facial identification being misused by police, big deal, a life threatening deal.

So you had You’d better make sure that that model is properly trained and unbiased.

So that’s how to evaluate you know a lot of these models and data sets and pre trained models and API’s from major vendors.

Is was level of risk and what is or consequences if it gets it wrong.

Bear in mind that an awful lot of machine learning models are biased especially in facial recognition.

And in natural language processing.

Natural Language Processing has a lot of hidden biases, the most obvious one of which is most of them are trained on the English language and English is, I forget who said it is a language of privilege.

It is the language of the wealthier part of the world.

It is not the majority language in the world.

And there are many, many, many, many billions of people who speak other languages.

And many of our machine learning models are not well suited to recognizing or processing those models.

And if you think some of the things that AI does with English are hilarious, you should see what they do to other languages.

When you give that some consideration and who speaks English and what race they are, and what gender they are, and what income level they are, what ethnicity they are, what religion they are.

You can see how even something as simple as using the English language could introduce biases into your models.

So keep that in mind.

It’s all about trust and risk.

How much trust Do you need in the model? How high is the risk, and that dictates whether you should be training your own versus using a third parties.

If you have follow up questions, leave them in the comments box below.

Subscribe to the YouTube channel in the newsletter, I’ll talk to you soon take care.

want help solving your company’s data analytics and digital marketing problems? Visit Trust insights.ai today and let us know how we can help you


You might also enjoy:


Want to read more like this from Christopher Penn? Get updates here:

subscribe to my newsletter here


AI for Marketers Book
Take my Generative AI for Marketers course!

Analytics for Marketers Discussion Group
Join my Analytics for Marketers Slack Group!