Mind Readings: The Danger of Old Text in Generative AI

Warning: this content is older than 365 days. It may be out of date and no longer relevant.

Mind Readings: The Danger of Old Text in Generative AI

In today’s episode, I address a critical aspect of training AI models—considerations regarding the content used for training. Many large language models have been built using content without proper permission, raising concerns about the appropriateness of the data. While using public domain content may seem like a solution, it often contains outdated or inappropriate ideas. Historical documents, textbooks, and newspapers may have historical value, but training machines on them can lead to undesirable outcomes. I emphasize the need for more thoughtful and intentional selection of training data to ensure AI models generate language that aligns with our desired values. Join me for a thought-provoking discussion on the responsible training of AI models. Don’t forget to hit that subscribe button if you found this video insightful!

Summary generated by AI.

Mind Readings: The Danger of Old Text in Generative AI

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.

Today in the USA is June 19 2023.

as I record this, it is the Federal holiday Juneteenth, which commemorates the notice that slaves were freed at the last major outpost in the US at the time, two years after slavery had officially ended, it was June 19 1865.

Today, what we want to talk about is some very important things to think about with the training of AI models.

And it should become clear in a moment why we’re doing this on jun team.

One of the big things that is very controversial about large language models today is that they’ve been scraped together with a whole bunch of content that companies like OpenAI did not get permission to use.


And so there are a lot of people who are saying, well, we what we should do is let’s make sure we have language models that are trained only on things that either we have permission to use, or are free of copyright, they’re in the public domain.

On the surface, this sounds like a good idea, right? On the surface, it sounds like okay, well, we’ll only use stuff that is in the public domain, we will only use stuff that does not need permission to be used commercially.

Because the way copyright law things works.

However, that’s problematic.

And here’s why.

Most stuff that is in the public domain is old.

Not all that there’s lots of works that are people release into the public domain, or through other alternative licensing systems like Creative Commons, etc.

But the majority of stuff that is in the public domain is in the public domain, because the copyright expired on it.

Or never even had copyright because it’s it’s that old.

With the challenge with old texts is they contain old ideas.

They contain all ideas, they contain things that you might not want a large language model to learn from, for example, at the Smithsonian Institute, which is one of America’s largest, actually is the largest, I think, public museum, you can find huge numbers of old documents from the early days of the country, the text of those documents has been transcribed.

And it’s freely available.

And because the Smithsonian especially is a a federal government institution, there’s absolutely no copyright and neither works.

So you’re like great, this will be a perfect source for us to get training data for AI that has no copyright restrictions.

Well, this is a bill of sale from 1800.

This bill of sale has been transcribed and the text of it is available online at the Smithsonian for free.

No copyright.

This is a bill of sale for a slave.

This is a bill of sale for a slave named Sam was sold to Edward Rousey of Essex County.

Do you want AI to be learning from this? There are contexts where you might you might have a specially fine tuned model that you use for doing other forms of historical transcription or historical analysis.

But do you want ChatGPT to have learned from this? Do you want ChatGPT to associate the words that are in this with other words that are in this and generate probabilities based on it because that’s how large language models work.

They are just probability engines guessing the next word based on all the words that they have learned.

This is probably the most obvious example of really bad ideas that are language and are free.

But you probably don’t want to be training machines on the concepts within these and having that be okay.

Right? Again, there will be use cases where you’d want to fine tune model to process and help process other historical documents and that’s totally fine.

But for tools that you unleash on the general public, not as fine.

Think about old history textbooks, old novels, old newspapers, from 1900 1875 1850 1825, they have historical value.

To be clear, there’s there’s no question they have historical value, we should not delete them or destroy them, they have historical value, but we should not be training machines on them.

Can you imagine? And this is a very simple example.

Can you imagine taking the knowledge from the maintenance of the Ford Model T And those concepts and applying them to a Tesla.

Right? Really bad idea, really bad idea.

When we think about how AI is being trained, there are a lot of problems with bias because human beings are biased.

And in the USA, which is where I am, we have centuries of bias, beginning with slavery, and going to the present day of racial discrimination, of wealth discrimination, and literally every kind of and our written words are filled with these are written words are filled with these from 1776 to 2023.

When I, when I heard, met a CEO Mark Zuckerberg say that the llama model that meta released was based in part on common crawl, which is the content of the web.

Plus data from Facebook’s family of apps, facebook, whatsapp, Instagram, I immediately thought, well, that’s not good, because there’s a whole bunch of garbage on Facebook that I don’t know that I would want a machine knowing, right, in terms of, of curating and deciding what should be what content should be used for training a machine and the language it creates.

So my caution to you, my recommendation to you and my recommendation to our profession as a whole can professional artificial intelligence is that we have to be a lot more thoughtful about what text we feed to models to train them on what images what the intended purpose of a model is, my general feeling is that a general purpose model, particularly one that you’re going to unleash on the general public, should be free from as much stuff that you don’t want it generating as possible, like, Do you want a an artificial intelligence modeled for the general public in 2023, to accurately generate a bill of sale for a slave, that’s probably not a great use case.

Right? Now, again, there are conditions where you might want that to be the case, like if you have half of an old memo, half an old bill of sale, and you’re trying to infer what the rest of that bill sell, if you have it some damage historical documents, that would be a clear case where you’d want a specially tuned models that the general public does not have access to wouldn’t use to do that job.

But in the general public model, I don’t know that there’s a really good use case for associating these words, and having a machine spit them out.

And just to be clear, all this stuff is private, private companies and things.

The rights that we associate with things like freedom of speech, freedom, to not be enslaved, etc.

Those were government functions.

And the government is required to uphold them.

Private companies generally don’t have to.

And there’s exceptions, like Title Nine, at least in the USA.

So for a company to say, Yeah, we’re not going to offer that in our in our model is every company’s prerogative.

And if you don’t like that, you can download an open source model, retrain it yourself, and have your model do what you want it to do.

No one is stopping you from doing that.

But I think this is a clear call to action for people working with AI to know what’s in these models, what they were trained on.

And to be able to say, like, look, perhaps some things shouldn’t be in the training data to begin with.

Because we’re not asking these things to be encyclopedias.

We’re not asking these things to be search engines.

We’re asking these things to generate language.

So let’s make sure that they’re working with the language that we actually want them to use, and do our best to remove that from what they are taught.

Again, don’t destroy the source data.

The historical documents need to exist for a reason.

But maybe don’t teach it to an AI.

That’s today’s show.

Thanks for tuning in.

We’ll talk to you next time.

If you’d like this video, go ahead and hit that subscribe button.

You might also enjoy:

Want to read more like this from Christopher Penn? Get updates here:

subscribe to my newsletter here

AI for Marketers Book
Take my Generative AI for Marketers course!

Analytics for Marketers Discussion Group
Join my Analytics for Marketers Slack Group!


Leave a Reply

Your email address will not be published. Required fields are marked *

Pin It on Pinterest

Share This