I explain the concept of reinforcement learning and how it is used to optimize Chat GPT for dialogue. OpenAI, the company behind Chat GPT, logs all interactions with the model and uses them to improve the system. Discover the research value that our interactions with Chat GPT provide and how they help improve the model in this informative video.
Can’t see anything? Watch it on YouTube here.
Listen to the audio here:
- Got a question for You Ask, I'll Answer? Submit it here!
- Subscribe to my weekly newsletter for more useful marketing tips.
- Subscribe to Inbox Insights, the Trust Insights newsletter for weekly fresh takes and data.
- Find older episodes of You Ask, I Answer on my YouTube channel.
- Need help with your company's data and analytics? Let me know!
- Join my free Slack group for marketers interested in analytics!
What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.
Christopher Penn 0:00
In today’s episode, Carol asks, Does chat GPT learn from my prompts and the text I feed it? And does it feed that knowledge to others? What research value do they get from our interactions? So with any AI service, it’s always a good idea.
Always a good idea to read the Terms of Service to read the frequently asked questions and any technical documentation is provided.
OpenAI, which is the company that produces chat GPT-3 has a good amount of actual documentation and disclosures on its website about what the model is and how it works.
So from bullet point two in their frequently asked questions, chat GPT is fine tuned from GPT-3 point five, a language model trained to produce text chat GPT was optimized for dialogue by using reinforcement learning with human feedback, a method that uses human demonstrations to guide the model towards the desired behavior.
So what this means when you’re talking about reinforcement learning, Reinforcement learning is when you train a machine learning model to perform a task of some kind, Shad, score things, guests things, categorize things.
And then you essentially take the uses of that model.
And you know, thumbs up, thumbs down, whether it did his job, and you feed that back to the original dataset.
And then you retrain the model, you basically haven’t rebuild itself.
And you keep doing this over and over and over again.
So that over time, as long as the responses are intelligible, and well curated, the model gets smarter, the model gets better at doing what it’s supposed to do.
Now, chat GPT does absolutely log everything you type into it.
In fact, when when you read Terms of Service term number six, will use my conversations for training.
Yes, your conversations may be reviewed by our AI trainers to improve our systems.
So everything you type into this system is being logged.
In fact, there is a specific note in here in the terms of service like hey, don’t put confidential information in here because this is not this is not the place where that it is being logged.
So is it learning from your prompts and the text feed it? Yes, it is the most overt way you can give feedback to OpenAI is that thumbs up thumbs down when you are using the service, you’ll see a little thumbs up thumbs down icons right next to its responses and you’re reading each response.
Yes, is a good response notice was not a good response helps the model learn how to provide that at training feedback for them to retrain their software.
That said, you can tell a lot about some of the responses by how much refinement there is, right.
So even if you don’t use the thumbs up, thumbs down, if you say, write me a poem in the style of Edgar Allan Poe, but about the 2020 presidential election.
And it does its thing, and then you keep asking for refinement after refinement after refinement.
That’s a pretty good indicator that the model is not doing what you intended to do, because he didn’t nail it on the first shot or the second shot or the third shot and so forth.
So even in cases where you’re not using that built in ratings feature, there are plenty of behavioral signals that would indicate Yeah, this this thing is not going right.
I don’t know.
So this is purely speculation on my part, but it would completely not surprise me if the, the outputs and then our inputs were basically being added to the training data set.
Add in hold, right.
So when you write a prompt and submit that, that goes into training data, right for the next iteration of the model.
So I would absolutely assume that that knowledge, any knowledge that we’re creating with the software is being incorporated into that reinforcement learning system, the human feedback of some folks in the AI community, we’ll call it active learning where it’s it’s retraining itself on a regular and frequent basis.
OpenAI seems to release a new version about every month or so.
So my guess is they they collect data for a certain period of time, they then retrain the model, and they roll out the newly retrained model.
And obviously, if it goes sideways, for some reason, they can just roll back to the previous fall.
But that’s, that’s what I’m pretty sure is going on underneath the hood.
So what research value are they getting from our interactions? They’re rebuilding the model, right? They’re improving the model.
OpenAI makes these really large language models, the GPT-2 series of models, a generative, pre trained transformers, so they had GPT-2, then three now 3.5, later this year, they’re going to be releasing four, which will, despite all the hype line is just going to be more of the same right? It’ll be better what it does, because it’ll have more data.
And critically it this is the part that I don’t think people understand about these, these models, critically.
Our interactions with it To provide richer training data that they can get just by scraping the internet itself, because if you scrape like a Reddit forum thread, yeah, you have some context.
But you don’t have that, that thumbs up thumbs down that behavioral data, as if, as opposed to when we work with a model directly and say, write me a poem about slicing cheese, but in the style of, I don’t know, somehow what I can tell by betraying my lack of studies in English class.
In the bath, these are very clear payers of information, a prompt response prompt response.
And that’s better quality training data for someone who’s building a large language model.
So that’s what’s happening with the data we’re feeding into this.
It is, we know for sure from what’s disclosed, it is being used to retrain the model, it would not be surprised, because it would not surprise me in the slightest if it was being used to train the next iteration of the big model GPT for right with all the conversations because this thing has taken off like like wildfire online.
And so 1000s If not millions of people are freely giving it a lot of information.
And that you as a researcher, as someone trying to build software, as someone tried to acquire high quality data, you couldn’t ask for a better, better way to do that, than to have a bunch of people eagerly running in to provide you with more training data.
So that’s what’s happening, but expect no privacy.
It’s in the terms of service, expect no privacy.
If you’re putting in like, you know, you have this rare medical condition.
I don’t know that I would put that into a system like this that is going to be reviewed in some part by the AI team that builds this.
So good question.
Thanks for asking.
If you’d like this video, go ahead and hit that subscribe button.
You might also enjoy:
- Almost Timely News, 17 October 2021: Content Creation Hacks, Vanity Metrics, NFTs
- Best Practices for Public Speaking Pages
- Marketing Data Science: Introduction to Data Blending
- How I Think About NFTs
- Understand the Meaning of Metrics
Want to read more like this from Christopher Penn? Get updates here:
Get your copy of AI For Marketers