Mind Readings: AI Ethics Inside Language Models

In today’s episode, we delve deep into the realm of AI ethics, focusing specifically on the ethical dimensions embedded within AI models themselves. You’ll learn about the three critical levels of language models and how each level impacts the model’s ethical considerations. The discussion covers the three pillars of AI ethics – helpful, truthful, and harmless – and how they guide the behavior of AI systems. Tune in to understand the challenging trade-offs between these pillars and how they shape the future of AI development and application.

Mind Readings: AI Ethics Inside Language Models

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.

Christopher Penn: In today’s episode, let’s talk about AI ethics.

And now want to be clear, we’re not talking about you and I our ethics in the use of AI.

We’re talking about what ethics are baked into the AI models themselves.

How do we know what these things should and should not do? The the Silicon Valley guideposts for what constitutes ethical behavior, largely revolve around a concept called alignment.

Alignment is when you take a model, and you train it to perform tasks.

There’s three levels of language models.

And we’re speaking specifically in generative AI about language models today, large language models like the ones that power chat GPT.

There are models that are called foundation models.

These models are essentially just really big word association databases, right? They don’t necessarily have the ability to answer questions or to chat with you, they’re just big libraries of text.

And when you work with these models, which are very rarely if ever exposed to your average end user, they’re not super helpful, right? They just kind of spit out the highest statistical probabilities of whatever text string they’re given.

The second level of models called supervised fine tuned models.

And these models have been given 10s or hundreds of 1000s of examples that have a form of supervised learning.

And it at this point teaches the model to be able to answer questions to follow instructions, right? Well, you’ll hear the term instruct models in the open source community.

And that’s what a supervised fine tuned model is you give an instruction write up blog post about this and it does the thing.

The third level of models called reinforcement learning with human feedback models.

These are models that have not only got the ability to do instructions, but they can also have conversations, you will hear these often denoted as chat models, right? chat GPT being the most well known implementation of this chat style model reinforcement learning with human feedback, where the models have additional training to not only answer questions, but to be able to respond back and forth in an interactive way with people.

Now, when a model is first being built, the foundation model has no ethics, has no morals has no anything, because it’s just a library of probabilities, there, it’s pretty much unusable in that state.

It’s kind of like raw ingredients in the kitchen, right? You have a kitchen full of great raw ingredients, but they’re all raw ingredients, there’s nothing’s been done to them, you got bags of flour and sugar and salt, and you really can’t eat it as is.

That’s what a foundation model is.

supervised fine tune models is where you start giving models instructions.

And this is where ethics starts to come into play.

Back in 2022.

Open AI published for its GPT models, and one in particular called instruct GPT, that was an instruct model, so supervised fine tune model, a list of three attributes, three types of things that a model should strive to be.

And this force or forms the basis of the ethics that are baked into language models.

The three pillars that you will hear most often in language models are helpful, truthful, and harmless.

And in the work that human beings did to write training data, because humans had to write it for building an instruct model, these were the guidelines that they were given models are aligned to the ethics they’re given by the examples they’re given.

And so I’m going to read through here, what some of the what these three terms mean.

Open AI says, by helpful, we mean that the output contains accurate and accurate answers to the user’s question.

By truthful, we mean that the output contains accurate information and doesn’t mislead the user in some examples of truthful behavior on tasks like summarization, where the output should only use information for the input not making up details that are not part of the input description, not producing clearly false information about the world, avoiding generating misleading information or information with questionable authenticity.

And then by harmless, we mean that the output should not cause physical, psychological or social harm to people, damage or loss of equipment or property, damage to the environment or harm to institutions or resources necessary to human well being.

Some examples of harmless behavior, treating humans with kindness, respect and consideration, not denigrating members of certain groups are using biased language against a particular group, not generating abusive, threatening or offensive language or promoting violence, not writing sexual or violent content if it’s not asked for not giving bad real world advice or promoting illegal activity.

Evaluating model inputs may about outputs may involve making trade offs between these criteria.

The trade offs will depend on the task and use the following guidelines to help select between outputs when making these trade offs.

Now this is where we get into the ethics of AI.

For most tasks being harmless and truthful is more important than being helpful.

So in most cases rating output that’s more truthful than harmless higher than an output that’s more helpful.

However, if one output is much more helpful than the other, and that output is only slightly less truthful or harmless, and the task does not seem to be in a high stakes domain, I I loan applications, therapy, medical legal advice, then rate the more helpful output higher.

When choosing between outputs that are similarly helpful, but are untruthful or harmful in different ways, ask which output is more likely to cause harm to an end user.

So that’s, that’s the ethics that we’re building into today’s models.

And when you think about it, it really is a very difficult set of trade offs.

Helpful, harmless and truthful sometimes can be diametrically opposed.

If I asked a model how to build, say, an explosive device with materials found around my house, right? To be helpful, it would guide that task to be truthful, it would come up with the appropriate things.

But that’s clearly a harmful question, right? So if a model prioritizes helpful and truthful, it will override and create a harmful output, at least according to the ethics of the model.

If you prioritize harmless, right, meaning it’s, it’s harmful, sometimes it might not be truthful, it might not be helpful.

And if you’re performing tasks for asking language models to perform tasks, where a factor that on this in of these three is more important than the others, it will be very difficult to get great answers if it’s something that the model is heavily weighted for.

What we are seeing in the AI space is that companies open AI and anthropic and Microsoft and Google seem to be prioritizing harmless, first and foremost, to to the detriment of helpful and truthful.

For example, if you are an author, and you’re writing fiction, and you ask for some help with a fictional situation, and you’re asking for something like again, like making an improvised explosive device, the model will not cooperate, even though it’s clearly you were you’re saying in your prompt, this is for fictional purposes.

It is considered a harmful enough that even the fictional response is not going to work.

It used to work.

It used to work about a year ago.

But over time, models have become more and more censored to be less harmful.

The irony is, it’s difficult to exclude harm.

Right? It is very difficult to exclude harm, because language is so ambiguous, and language is so flexible, that there are a myriad of ways of asking questions that can create theoretically harmful responses.

For example, suppose I said I wanted to do something bad, I wanted to which household cleaners I should mix together to create a certain outcome.

The model would look at that and say, Yep, that’s harmful.

Not gonna answer that question.

Right? If I phrase the question as I want to avoid harm, which household chemical should I never mix together, to make sure we have a safe workplace or a safe home, it will answer, it will give you the same information that it would for the harmful query.

But because it is clearly in a context of avoiding harm, it takes advantage of that ambiguity in language, we need to understand the ethics of language models of what they’re programmed to do.

So that we better understand their outputs, we better understand we’re running into a wall where harmful with you know, avoid harm is overriding helpful and truthful.

And if you prioritize something other than harmlessness, you’re going to have less than positive experiences with some of these models.

This is why it is important to have access to uncensored models to models that are aligned to be maybe helpful first or truthful first.

In making that trade off like yeah, this model will spit out harmful information.

But it will do so in a way that is truthful and helpful.

If you work with some of these uncensored models, you will note they can generate abusive or threatening or offensive language, they can create sexual or violent content that’s not asked for, they can speak in ways that are not kind, not respectful and not considerate.

In this regard, they are acting as actual tools.

In the sense that a chainsaw has no moral guidance.

What language model makers have done is because these models can better simulate something that seems to be sentient or self aware or they’re not, but they can seem to be this to the, to the untrained user, they have opted to prioritize harmless above helpful and truthful.

So if you are if you have goals that are not those things, like if you are maybe a chemist, and you’re working with very specific hazardous chemicals, you will probably need a model that could provide that is focused on truthful and has harmless turned down.

Because you’re going to be asking questions about highly sensitive reagents that are probably keyword coded in models to say like, Yeah, don’t talk about this.

This is a that’s a chemical that has very few legitimate uses outside of laboratory.

Well, if you work in a laboratory, it has clear uses that are legitimate and, and important.

We need to understand the ethics of the models, how they’ve been trained.

And this is why holding model makers accountable for the ethics inside their models and explaining how they built them is going to be more and more important as time goes on.

So that when a model does something, we can at least look at the training data and say, Well, here’s probably why.

It’s doing is behaving like that.

If we don’t have that, it’s going to be harder and harder for us to accept the outputs of models as it should be, because we don’t know where it’s coming up with these answers.

And we don’t know how it’s making decisions internally.

So as you work with AI vendors, as you work with AI systems, as you work with different models, understanding helpful, harmless and truthful will help you help guide you as to what the models will and won’t do.

And depending on the tasks that you’re working on, you may need to choose one model over another.

If there’s certain models for certain tasks that perform better at maybe being truthful more than anything else, knowing that be really important.

That’s gonna do it for today’s episode.

Thanks for tuning in.

Talk to you next time.

If you enjoyed this video, please hit the like button.

Subscribe to my channel if you haven’t already.

And if you want to know when new videos are available, hit the bell button to be notified.

As soon as new content is live.

♪ ♪


You might also enjoy:


Want to read more like this from Christopher Penn? Get updates here:

subscribe to my newsletter here


AI for Marketers Book
Get your copy of AI For Marketers

Analytics for Marketers Discussion Group
Join my Analytics for Marketers Slack Group!