Dave asks, “What are some of the types of bias to be aware of in machine learning?”
This is a powerful and important question today. As we give more power to our machines, we need to understand how they’re making decisions. Watch the video to learn the four major categories of machine learning bias to look for, and ways to address them.
Can’t see anything? Watch it on YouTube here.
Listen to the audio here:
- Got a question for You Ask, I’ll Answer? Submit it here!
- Subscribe to my weekly newsletter for more useful marketing tips.
- Find older episodes of You Ask, I Answer on my YouTube channel.
- Need help with your company’s data and analytics? Let me know!
What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.
In today’s episode, Dave asks, what are some of the major types of kinds of machine learning bias? This is a really, really important question. As we give more control to our machines. As we let them make more decisions in our everyday lives, we need to understand how those machines are making decisions and what basis and those decisions are made on. Remember that the fundamental thing about machine learning is that machine learning is math, right? It’s prediction, its probability based on existing data based on the data that a machine was trained with. And so if there are issues in the data, there will be issues in the predictions, the forecasts and the analyses that it makes. So with that in mind, let’s bring up the four kinds of bias here. Now, these broad categories, and these are machine and data set biases, there’s a whole other category of human biases of things that we do in our own cognitive abilities that create biased outcomes. And that’s separate. That’s a separate discussion,
I would go and actually check out there’s a really good resource called your bias is and it will, it has a whole nice chart and an interactive graphic that you can explore the different types of bias like selection, bias, anchoring, etc, that that are human flaws, human ways that we make poor judgments based on data.
So let’s go through these. The first is intentional bias. This is
this is the most probably
obvious bias it is when I’m a human designs and algorithm to a specific outcome that is biased.
The most well known
example of this was documented by pro publica were a police department put together an algorithm to predict
whether criminals would be would would re offend, would commit additional crimes. And the algorithm was 20%, right, which, you know, you better off flipping a coin, but it predicted African Americans, would we offend it five times the rate they actually did, that was
a clear case where someone just baked their bias into the algorithm itself, they corrupted the software itself.
So that’s the first time the second type of bias is
similar ish, its target bias. And this means that the target population has been the subject of bias. And therefore clean historical data is difficult to obtain. So
imagine, for example, you’re trying to do a longitudinal study of African American healthcare with an intent to predict health outcomes, African American health, your data is essentially corrupted by macro conditions, because African Americans have not received the same quality of health care that
other populations have, the data that you have, even if it is technically correct, still has a bias to it still is not usable
as is, you would have to do an extensive amount of cleaning and you’d have to do you have to take into account so the macro conditions you there’ll be certain after periods of time when frankly, you could not use some of the data because the data simply would be so corrupted by bias, so corrupted by what was happening in the past that you have to throw it out,
you might, for example, need to disregard entire regions of the country, if you were using certain data sets, you might have to even disregard down to the institution or the provider level. So there are there’s a lot of target population bias in in the data out there.
The third one also related is source data. This is where the data source itself is corrupted, or is biased. And that prevents or disrupts our waiting efforts. Now, this is different than the target population. This is the source itself, regardless of population,
because there’s a known bias to it. So really simple example of this one is certain social networks have very specific bias used to them.
If you were, for example, looking at a network like Stack Overflow, guess what, there’s a massive gender bias in Stack Overflow. So if you were using that, to mine information about programming and statistics and software, you’re not getting a gender representative perspective, if you are mining Pinterest, you’re going to get a very specific bias. If you are mining Twitter, you’re going to get a very specific bias and understand and knowing these biases is important, because that does disrupt your waiting efforts. If you are waiting the data to you have to do a lot more work and a lot more rebalancing. And it’s going to take you much more time to do annotations and, and things like that, and markup of the data because the sources itself
are biased. This is one of the reasons why market research is so essential and is not something we can just automate with a click of a button. Because we have to be able to account for biases and ideally prevent
them in the first place from the sources we work with. The fourth type is tool, this tool bias. And
this is when our software itself is unable to process all the relevant types of data. To get the complete picture. Super simple example, in the Instagram API. When you pull data out of the Instagram API, you get the username, you get the description, and then you get a URL to the photo. If your AI system or your machine learning system is ingesting all this text data and making analyses based on it. But you’re not doing any kind of image recognition, you’re missing like 80% of the point of Instagram, if you’re if you’re not seeing the image and you don’t know what’s in the image, you can’t rely on the description, the description of what people put on Instagram photos, sometimes has very little to do
with what’s in the actual photo, one thing people love to do is they’ll they’ll put like a, you know, a little 100 emoji and tag five of their friends in the description. And it’s a picture of a boat, right?
So you if you don’t have that image data, then your tool is essentially creating a bias is creating a bias in the data that says you’re not accommodating all the different types of data, if you are doing social network analysis, very, very important that you’d be able to do that
we see this also in in all sorts of other areas. You can you’ll see it even in things like census data, you’ll see it in political data hugely, because we don’t take into account things like video and audio and stuff, it’s a lot more work and it’s a lot more expensive and a lot more time consuming to accommodate every possible data type or all the relevant major types of data. So keep these four categories in mind intentional target source and tool. This is what’s going to help guide you as to like, are we getting all the right data are we
going to have outcomes in the data that are going to screw up the algorithm and as a result, we will not get clean results or we will get flawed results.
If you are thinking about bias from the beginning. If you are baking bias, assumption and prevention in by design from the beginning of a project, you stand a much better chance of getting a good outcome than if you just kind to throw data in and hope that the machine figures it out. That’s not the way to go. That is that’s going to cause some issues. So keep this in mind. Great question,
Dave. Powerful question and an
important question we need to tackle. As always, please subscribe to the YouTube channel in the newsletter. I’ll talk to you soon. Take care
if you want help with your company’s data
and analytics visit Trust Insights calm today and let us know how
we can help you
You might also enjoy:
- Rain Boots, Slides, and Strategy
- Simple Is Not The Same as Easy
- Unsolicited "Embargoed" Press Releases Are Absurd
- How to Set Your Public Speaking Fee
- You Ask, I Answer: The ROI of Data Quality?
Want to read more like this from Christopher Penn? Get updates here:
Get your copy of AI For Marketers