You Ask, I Answer: Reliability of LLMs vs Other Software?

In today’s episode, we’re tackling the intriguing world of generative AI and language learning models (LLMs), focusing on their unique challenges and potential. You’ll learn about the differences between AI and traditional software, the importance of fine-tuning in AI development, and how this impacts its usefulness and reliability. Discover the concept of ensemble models and how they enhance AI performance and accuracy. Tune in for an insightful exploration into the evolving landscape of AI technology and how it’s shaping our future.

You Ask, I Answer: Reliability of LLMs vs Other Software?

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.

In today’s episode Oz asks, “With this AI stuff I sense a shift in thinking.

The mantra always seems to be it’s not so good now but it’s quickly improving.

This is different from new software coming out and it mostly kind of works and I can decide if it’s something useful for my needs.

If not, I move on.

No harm done.

But AI seems to be this whole ‘imagine the future’ potential.

How long does a person have to dance around with something janky before it either proves to be useful or not?” Oz went on to say, here let me pull up the comment, “A variation of this came with my need to get 10 four-letter palindromes that got 8 good ones and 2 or 5 letters long.

Two things happened.

Some folks said if I was paying for GPT-4 the result would have been perfect.

Someone else said it’s on me to decide if 80% was good enough.

These LLMs are weird, different from tools that are immediately useful or not.

Other tools don’t ask users to engage all this murkiness at 80% where the understanding of it getting better might eventually get to 100%.

So what’s going on? Okay, here’s the thing.

Language models are a totally different kind of beast.

They’re a totally different kind of software.

And there are pieces of software that at their fundamental levels, they are never correct.

So there’s three levels, there’s three tiers of language models.

There are foundation models, which are the raw goods that have been assembled.

And the way this works is, if you take the enormous amounts of text on the internet and do statistical analysis of all of them, what you will end up with is a model that could statistically predict correctly what’s nearby in a word.

Right? For example, OZ is an Excel, Microsoft Excel MVP.

If you look at all of the words near Excel, just the word Excel, you would of course get Microsoft, but you’ll also get words like surpass, exceed, transcend, any of the word spreadsheet is in there too.

When we train, when we build these foundation models, when big companies like OpenAI and Microsoft build these, all of that is in there.

And so if you were to prompt it, a foundation model and ask it about Microsoft Excel, you might get some gibberish.

Because it’s pulling.

It’s pulling up the words that are statistically correct for the query, even when those words are factually wrong.

When we do what’s called fine tuning, what we’re actually doing is we’re actually breaking these models.

We are saying, hey, what you answered here was statistically correct, but it’s wrong.

So we’re going to say this is the correct answer, but it’s not statistically as relevant.

If you were to, if you were to, you know, condition a model fine to it, you would say, always say Microsoft Excel.

And then it would prevent it from ever saying something like, you know, Microsoft exceed or exceed spreadsheet or something like that, where there’s a word relationship that would be statistically relevant, but not factually correct.

Now to the example that Oz gave, yes, GPT-4 is a better model than GPT 3.5, which is the free version of chat GPT.

Why? Two things.

One’s got a lot more data in it.

It has a much larger latent space or memory.

So it has seen Microsoft Excel, or in this case, its palindromes, more than say a smaller model will.

But two, it’s more broken, right? In the sense that it has been fine-tuned and tuned with reinforcement learning with human feedback so that it gives more correct answers, what we call factually correct answers, which are inherently, at least with the way these models work, statistically wrong, right? So.

I don’t want to say, I want to see more of this.

It will give you probabilistically what it’s been trained to do to not be the statistically correct answer.

If you go to an image model, I was just working on this the other day, and say, I want you to make an image of two dogs and two cats and here are the breeds, it’s going to really struggle with that.

Why? Because while it may have seen a Newfoundland or a Chartreux or a short-haired black cat, it may not have seen them all in combination enough that it can replicate or have an understanding of what it is that it’s doing.

Language models, but really all generative AI is probability-based, it’s predictive-based, which means that it can never be 100% correct, never.

It can be 99.999% correct, but never 100% correct because the probability engine that is underneath all these things will always have the possibility of coming up with something realistically similar to what you wanted, but not factually correct.

And that’s the distinction with these things.

So will this always be the case? To some degree, the models themselves will always have that randomness in them, it’s called stochastic probability, that means they can go off the rails.

The way to counteract that with a lot of systems is to not just have one big model, instead you have an ensemble of them that have different tasks.

So you might have one model that generates, another model that fact-checks and says, “Hey, this doesn’t match up with my known data.” You might have a third model that’s looking for things like bias in its responses.

You might have a fourth model that manages the workload among these things.

There’s a whole architecture actually called “mixture of experts” which kind of performs this task to some degree.

And that GPT-4 is not one big model, but it is in fact an ensemble of different models.

No one from OpenAI has ever confirmed or denied that that is part of the architecture.

But it’s suspected of that because it’s very difficult to get the speed and performance that OpenAI delivers with GPT-4 from a model that big.

If you look at the open source models, they can’t behave in the same way with similar compute power.

So something’s going on behind the scenes.

That’s part of their secret sauce about why their software behaves so well.

To the end user, to you and me as users, it just works well.

It works pretty well.

Architecturally, it’s probably very different under the hood.

So that’s the answer.

That AI is evolving.

It will never be perfect.

It will never not have the element of randomness.

And the way to counteract that and reduce it as much as possible is through ensembling.

So really good question.

Thanks for asking.

If you enjoyed this video, please hit the like button, subscribe to my channel if you haven’t already.

And if you want to know when new videos are available, hit the bell button to be notified as soon as new content is live.


You might also enjoy:


Want to read more like this from Christopher Penn? Get updates here:

subscribe to my newsletter here


AI for Marketers Book
Get your copy of AI For Marketers

Analytics for Marketers Discussion Group
Join my Analytics for Marketers Slack Group!