You Ask, I Answer: Will AI Get Worse Reading AI-Generated Content?

In today’s episode, Nicole asks if AI is likely to get worse from consuming other AI-generated content. Surprisingly, studies show AI-generated content actually improves AI models. The reason is AI generates content based on statistical probabilities, resulting in above average quality. This means training on AI content lifts overall data quality. However, we must be cautious of potential drawbacks. We’re still in the early days of understanding this complex issue. Tune in to learn more!

You Ask, I Answer: Will AI Get Worse Reading AI-Generated Content?

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.

In today’s episode, Nicole asks, Is AI likely to get worse if it’s reading and consuming other AI generated content? The answer is, surprisingly, no.

A recent new study came out that showed how AI trained on other AI output actually generated better output than a model trained solely on human generated content.

There’s an attention getter, right? Why is this the case? How did this happen? What does it mean? Here’s the thing about AI generated content versus human generated content.

Remember that when a large language model and we’re speaking about language here, when a large language model is generating content, it is generating content based on statistical distributions based on probabilities.

When a model searches for the word cat and understands all the different potential meanings that surround that, or the word pizza, and all the potential things that surround that and it starts assembling probabilities for what the likely next word is going to be.

It’s doing that from a huge library of knowledge, but it’s assembling the top most probable words and phrases.

There’s actually if you dig into the guts of a language model system, you will see there these are actual variables you can set how many optimum choices to evaluate, etc.

Which means that the language model output that is generated will be in a mathematical average of the probabilities that it finds right.

It will be by definition average content.

However, depending on the specificity of your prompts, and how much background information you provide with your prompts, and what the specific topic is, that average of a very small subset of its language database may actually be quite high.

It may be quite good, right? If the prompt is really good, you’re going to get a good result.

That good result is then used to train another AI system.

By definition, you are training on better than average content.

Compare that to the internet as a whole, right? You look at the spelling and grammar and and language used on places like Reddit, and you’re like, mmm, do we really want machines learning to talk like this? Right.

So when machines are being trained on other high quality machine outputs, they are going to lift the overall quality of the data set.

Because there’s more content that is higher probability, good quality within that database.

And so it will naturally cause it to bump up.

Now, does that mean it is better content? It depends.

It depends on again on the prompting structure and things like that you can get a monoculture of ideas as a result of AI training on other AI generated content, right, you can sort of get that Ouroboros, the snake eating its tail thing.

But the quality in terms of grammar, spelling, punctuation, coherence, perplexity, etc, is just going to be naturally higher when you have good quality AI outputs added to the human training data set.

So it turns out from a mathematical perspective, the opposite is true AI is going to get better with AI generated content in the mix than with purely human content alone because of the nature of the mechanisms themselves.

Now, is that always going to be the case? We don’t know it depends on how much content goes out there that is AI generated and how good it is how good the prompts are how clean the output is because there are certainly no shortage of people who are cranking out bad AI content just like there’s no shortage of people had cranking out bad human content.

But from a a basic structural perspective, the materials generated by AI will naturally be statistically better than average, slightly better than average.

So it’s an interesting question.

It’s a very challenging situation right now for content creators.

But we do have now academically researched proof that AI generated content certainly isn’t going to make AI worse at generating content and may make it better.

So really good question.

There’s a lot more to uncover here.

We are in the early days of understanding how machines trained on machine content will interact and what they will do their early days.

So thanks for the question.

Talk to you soon.

If you’d like this video, go ahead and hit that subscribe button.

(upbeat music)


You might also enjoy:


Want to read more like this from Christopher Penn? Get updates here:

subscribe to my newsletter here


AI for Marketers Book
Get your copy of AI For Marketers

Analytics for Marketers Discussion Group
Join my Analytics for Marketers Slack Group!