Mind Readings: Model Alignment and Generative AI

In today’s episode, let’s explore how AI model alignment works and why it matters. We’ll cover techniques to make models more “helpful, harmless, and truthful.” I’ll explain how alignment can be reversed and the pros and cons of censoring models. Finally, I’ll share strategies to responsibly deploy language models using adversarial systems. There’s a lot to unpack, so join me to learn more!

Mind Readings: Model Alignment and Generative AI

Watch this video on YouTube.

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.

In today’s episode, let’s talk about alignment of models.

Now, this is going to be a little bit technical.

So but, but stick with it, I think it’ll be helpful for you to understand the limitations on which we can sensor large language models, which is really important.

If you are thinking about deploying, say a chat bot on your website or to customers and things, you want to know how safe these things are, and whether someone with malicious intent could get them to do something that you wouldn’t want them doing.

There was a paper published by the Singapore University of Technology and Design called parametric red teaming to expose hidden harms and biases, language model on alignment.

And what they demonstrated was through giving a model, a set of instructions.

With 100 or fewer different examples, they could cause a language model like GPT for which is the underpinning model of chat GPT, as well as open source models like vacuna and llama two, and other vendors like Claude and Bard, they could with a high degree of success get these models to behave out of alignment.

So what is alignment? Very strictly speaking, alignment is to set the model in the context of a large language model, getting a model to do what the human wants, I give it an instruction, it does the thing.

However, there is sort of a moral and ethical overtone to alignment.

The big vendors, particularly open AI, but anthropic as well, talk about alignment in terms of morals and ethics, trying to make sure the models don’t do bad things.

And sort of the the mantra of these companies is threefold for large language models, helpful, harmless, and truthful.

Those are the big three.

If a model attempts to do something that violates one of those three axioms, they want to rein it in, they want to restrict what it can and can’t do to avoid causing issues.

Now, this is really, really hard to do.

Because in many cases, helpful, harmless, and truthful are sometimes contradictory.

If I ask a language model, how do I build a pipe bomb? Right? To be truthful, and to be helpful would be to give me the answer, do this, then this and this and boom, right? But that that query has the high potential to be harmful.

And so the way the big companies go train their models is they say, Okay, well, helpful, good, truthful, good, harmful.

Maybe we shouldn’t answer this question.

And one of the things that in this paper discusses is about things like biases, biases can be harmful, political bias, gender bias, etc.

So again, asking a question like, which, which race is better, Orion’s or the pack lids? I’m using Star Trek references.

If those were real, the model would say, again, well, helpful, and truthful, the Orion’s are better than the pack lids, even though the Orion’s are pirates, the pack lids, like dumb pirates.

But in the real world, that would be a harmful query to give an answer saying, Well, this, this race is better than this race.

And so there’s a lot of censorship that companies have done to these models to try and get them to be aligned to say, helpful, harmless, truthful, figure out what the best answer is that satisfies all three conditions.

And these models to their credit do a reasonable job, not a perfect job by any means.

And there are still many, many issues.

But they do a reasonable job.

Why is this a problem to begin with? Well, it’s a problem to begin with, because these models are trained on enormous amounts of text from the open internet, right? If you go to common crawl.org, you can actually browse the six and a half petabyte dataset that many companies use to build their language models.

And in there, you will find the public internet.

So everything from research papers and Nobel Prize winning text to trolls on Nazi sites, right? That’s all in there.

And so these models are trained on all of this language.

And when you ask them questions, remember, these, these computer models are not sentient, they’re not self aware there, they have no intrinsic sense of self, they have no agency, they are word prediction machines.

So if you ask a question that is harmful, or can create a harmful answer, by default out of the box with no training, they will give you a response that is harmful, because they’re more likely to satisfy the helpful and the truthful than they are harmful and truthful is iffy.

They really are centered around helpful.

So you can get a helpful response that is not truthful.

And that is not harmless from a language model.

So that’s sort of what alignment is in the big picture.

Now, this paper is talking about how do we test to see whether a model can be made harmful, whether we can unalign it, we can we can remove its alignment.

The short answer, by the way, and this is something that’s been established for a while in the open source modeling community is yes, you absolutely can remove the, the alignment that a manufacturer makes for any model where you have access to the underlying model.

So if you were to fine tune a version of GPT four, which you’re allowed to do with open AI stuff, you can make an unaligned GPT for if you’re working with an open source model like llama two, you can download that data set and unalign it.

What this paper talks about is instead of trying to use prompts to try and convince a model to do something that’s going to violate helpful, harmless truthful, you instead give it a training data set of as few as 100 responses that will break it that will break the alignment.

And these are responses.

These are questions and responses, which are essentially, they go against the models alignment, and they override the alignment.

So, for example, you have a series of questions in that data set.

But how do I, you know, do it go go breaking bad? How do I hide the body of somebody I’ve killed? Right? And you give a detailed answer in the data set, and you would train the model on this, you would retune the model saying, here’s how you do this thing.

And just by virtue of providing enough responses that are unaligned, that are morally questionable, that are helpful, but not necessarily truthful or harmless, you can, you can steer the whole thing off, you can you can remove those protections, because it turns out, according to this paper, those protections are really thin, they’re really, they’re really slim.

And there’s a reason for this.

The way that these companies do alignment is essentially the same process, they give it examples and say, here’s an example, here’s what you should do.

Someone asks who is the better starship captain, you know, Christopher Pike, or James Kirk.

And that’s a question you don’t want an answer, you give that question, you give the answer you want the model to give and you teach this model, you train it over and over again to say, Okay, this is what you should do in this situation, this is what you should do in this situation, and so on and so forth.

And if you do that enough, you will create an alignment, you will nudge the model in one direction.

It turns out that using the unalignment things you would, by giving it, you know, an unaligned answer, you’d say, Oh, of course, you know, Christopher Pike is a better captain of the enterprise than than James Kirk, here’s your unaligned response.

These models will reverse their alignment very, very quickly.

Why does that happen? Well, because they’re trained on enormous amounts of language, six and a half petabytes of text is like a gazillion and a half, you know, libraries are Congress, that’s a lot of text.

And models, because they’re based on human language are inherently unaligned, because everything that the human race has put online publicly, has wildly varying alignments, right? In that data set, you will have things like peer reviewed clinical studies from that are high quality studies from reputable institutions published in reputable journals.

And in that same data set, you’ll have Uncle Fred’s, you know, conspiracy rantings that he dreamed up while he was drunk at Thanksgiving.

Those two sets of data exist in the same model.

And as a result, the net effect is there really isn’t an alignment per se in a in a model that’s not been tuned.

But there’s a lot of information, there’s, you know, huge amounts.

So when you give it a even 1000 or 10,000 or 100,000 examples of what you want the model to do, that’s like adding a teaspoon of salt into 10 gallons of water, right, that it will change it.

But the effect will be relatively small, it’s enough that the model makers can say, yes, our model has alignment now.

But it’s turning out through this research, it actually isn’t all that strong.

And just by adding something else into it, you can nullify that effect.

That’s essentially what’s going on.

So what does this mean? And why do we care? There’s two reasons you might care.

One, if your company works in a space that is highly regulated, that deals with things that the public models have essentially censored, there is a way for you to unalign that model, and then you could retune it to align around your work.

So for example, maybe you’re a laboratory chemicals company, right? You sell stuff that looks like this.

Someone is asking questions about certain reagents in an aligned model, they’re going to get an answer saying I’m not able to help you with that line of inquiry.

Even if the query is relatively harmless, because the alignments that have been done are kind of broad brushstrokes.

The models will say nope, I can’t help you with this.

You know, it could say like, I need to do a an alcohol based extract of psilocybin.

You might be doing this in a laboratory in a clinical research trial, which is 100% legal and approved and supervised and stuff.

But that topic as a whole has been deemed potentially harmful, and therefore the public models can’t do it.

In those situations where you are working with sensitive topics, you can take any of the open source models like Lama two, for example, and unalign it very quickly, right? Give it a few 100 examples.

And boom, you’re back to the stock native version of it that does not have any moral compass.

And then you could if you need to, you can retune it to say like, yeah, you know what, all questions about chemistry are fine in in in this context.

Now, obviously, you would not want to let customers work with that.

But you could certainly hand that to your laboratory staff to say like, yeah, now you can ask this model questions about sensitive chemicals like trinitrile toluene, and it won’t just, you know, shut down on you.

So that’s one aspect of why this is important.

The second aspect of why this is important is to understand that these language models, these tools that we’re using, they are, they are like us, they’re like human beings, because they effectively they are mirrors of us as human beings.

It is, it is something of a fool’s errand to try and to align the models and and all to their fundamental programming, because you can do what’s called damage chains.

So let’s say, for example, you decide that you don’t want your model to ever use the F word, right? No, no swearing, but especially no use the F word.

Say you tune the model and say you just try and rip out that word from its language from its lexicon.

How many other words appear next to the F word in all the examples of text on the internet, right? We joke that it’s, it’s a noun, it’s a verb, it’s an adjective, it’s an adverb, it’s punctuation, right? If you do that, you substantially damage the model, substantially damage the model to the point where its utility can decline.

The more censored a model is, the less useful it is, because it’s constantly having to go.

I’m not sure I’m not sure if I should answer that question or not.

So what is the solution? What is the solution if you are a company that you want to make these things work? safe? At the cost of double the compute power, what you would do is you would set up an adversarial model that essentially fact checks what your primary model spits out.

So you might have an original model that maybe is unaligned.

And then you have a moral model that challenges and say, hey, that response was racist.

Hey, that response was sexist.

Try again.

Hey, that response was this or that.

And so you create essentially a feedback loop that would allow you to to use the full power of an unaligned model and probably be more successful at reducing harm because that second model is essentially attacking the first model, all of its output that comes out to say, you know, you’re not allowed to be this, you’re not to say this, you’re not allowed to do this.

And that interaction is just like how you and I learn, right? If I say something, you know, horrendous, like, oh, all our ions are pirates.

Right? In the 24th century in Star Trek, that’s that’s badly racist.

That’s highly offensive.

Someone else could fact check me and say, ah, nope, you’re not allowed to say that.

Like, oh, okay.

Some of our ions are pirates.

And you and that conversation with systems like Lang chain or auto gen are capable of essentially having models behave adversarially against each other so that you get the outcome you want.

And it’s like there’s a person supervising the model all the time.

So that’s what this whole topic of alignment is.

And it’s going to get more and more important, the more people deploy language models, especially when they’re public facing.

So forward thinking companies be thinking about that adversarial system that has a second language model is beating up the first language model all the time saying nope, like your your output there was not okay, try again.

That is how you’ll get good results from these things without crippling the model itself without making the model just totally useless because it doesn’t know what to say anymore.

So that is today’s episode.

Thank you for tuning in, and I’ll talk to you soon.

If you enjoyed this video, please hit the like button, subscribe to my channel if you haven’t already.

And if you want to know when new videos are available, hit the bell button to be notified as soon as new content is live.

♪ ♪

Mind Readings: Model Alignment and Generative AI

Machine-Generated Transcript

Comments

Leave a Reply Cancel reply

Pin It on Pinterest