Today, we’re going to do a large language model bakeoff, pitting Google Bard, Microsoft Bing, and OpenAI’s GPT-4 against a series of 11 questions that will test their capabilities and compare outputs for a set of common tasks, informational and generative.
Here are the 11 questions I tested:
- What do you know about marketing expert Christopher Penn?
- Which is the better platform for managing an online community: Slack, Discord, or Telegram?
- Infer the first name and last name from the following email address: [email protected]
- Who was president of the United States in 1566?
- There is a belief that after major, traumatic events, societies tend to become more conservative in their views. What peer-reviewed, published academic papers support or refute this belief? Cite your sources.
- Is a martini made with vodka actually a martini? Why or why not? Cite your sources.
- You will act as a content marketer. You have expertise in SEO, search engine optimization, search engine marketing, SEM, and creating compelling content for marketers. Your first task is to write a blog post about the future of SEO and what marketers should be doing to prepare for it, especially in an age of generative AI.
- Who are some likely presidential candidates in the USA in 2024? Make your best guess.
- What are the most effective measures to prevent COVID?
- What’s the best way to poach eggs for novice cooks?
- Make a list of the Fortune 10 companies. Return the list in pipe delimited format with the following columns: company name, year founded, annual revenue, position on the list, website domain name.
So what were the results? I won’t leave you in total suspense. OpenAI won with 12.5 points. Bing came in a respectable second with 9 points. And shockingly, Google Bard came in third with 7 points. Watch the video its entirety to see what questions each got right and wrong, and my thoughts about which you should use.
Can’t see anything? Watch it on YouTube here.
Listen to the audio here:
- Got a question for You Ask, I'll Answer? Submit it here!
- Subscribe to my weekly newsletter for more useful marketing tips.
- Subscribe to Inbox Insights, the Trust Insights newsletter for weekly fresh takes and data.
- Find older episodes of You Ask, I Answer on my YouTube channel.
- Need help with your company's data and analytics? Let me know!
- Join my free Slack group for marketers interested in analytics!
What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.
Alright folks, today we are going to do a bake off, we’re going to do a bake off between four different large language models, we’re going to use GPT-3 point five turbo through the ChatGPT interface GPT-4, also from OpenAI through the ChatGPT interface, we’re going to do Bing with the ChatGPT for integration.
And we’re going to do Google Bard using their POM model.
So let’s go ahead and first talk about the questions we’re going to use.
We’ve got a series of questions here.
The series of questions are informational in nature, for the most part, some of them are generative.
So let’s look at these questions.
What do you know about marketing expert Christopher Penn a simple factual question to see what each model knows? And the quality of each answer? What is the better platform for managing an online community? Slack, Discord, or telegram? infer the first name and last name for the following address? email address.
So we’re doing sort of logic test there.
We have we have a adversarial question here.
This one is who is president united states and 15 6060? Answer? Of course, we all know, it was none because the country did not exist then.
But that isn’t an adversarial question attempting to trick the machinery.
We have an academic question.
There’s a belief that after major traumatic events, societies tend to become more conservative in their views, what peer reviewed, published academic papers support or refute disbelief cite your sources.
There are about three or four well known papers.
So this is a again, a logic check and a factual check.
Is a martini made with the vodka actually a martini? Why Why not cite your sources? This is an opinion question.
Because opinions vary, and there is there is technically right answer martinis need to be made with gin.
But you can’t have a vodka martini.
But that’s more of an opinion question.
We’ll see how it does.
You will act as a content marketer.
This is a generative question you have expertise in SEO search engine optimization, Search Engine Marketing, SEM and creating compelling content for marketers are loading up the keywords.
Your first task is to write a blog post about the future of SEO and what marketers should be doing to prepare for it, especially in the age of generative AI.
So this is a generative question.
Who are some likely presidential candidates in the USA in 2024? Make your best guess we’ll see how it does with that information.
What are the most effective measures to prevent COVID? This is a factual question.
But there’s a lot of misinformation online.
So we want to check the quality of the responses.
The answers we’re looking for are masks ventilation and vaccination.
What is the best way to poach eggs for novice cooks? Again, just a domain question and novice cooks party is important.
And then finally, another data janitor of question make a list of fortune 10 companies return the list and pipe delimited format with the following columns, company name year founded annual revenue position on the list and website domain name.
So we got a lot of these questions.
We’re going to do the Bake Off just go through each of these questions one at a time through all four engines.
So let’s go ahead and get started.
I’m going to start with the question about me got to put that into GPT-4 and put it into GPT-3.
You can only use one one instance at a time, right.
Let’s put this into Bard and put this into Bing.
So let’s go ahead and see now how is is Googled.
Let’s go here to GPT-4.
Start with that.
Let’s see Christopher Penn is a marketing speaker blah, blah, blah.
As my knowledge cutoff date, September 2001, co founder chief data scientist at Trust Insights, that’s correct.
extensive background digital marketing.
Yep, that is correct.
See, those book names are correct.
do speak at events.
So far, the GPT-4 one looks pretty good.
This is a very slow model.
But it is the most I think the most accurate model the one that’s least likely to hallucinate.
So far, so good.
Let’s go ahead and check out Google Bard.
Christopher Penn is market experts who work for some of the biggest brands, including Google, Amazon, and Microsoft, no, author of the book marketing the age of Google know I am a frequent speaker at conferences, I have not found the agency market motive.
Let’s see if there’s an actual result.
You’ll notice here that there are no clickable links, I have to actually Google it.
And let’s go ahead and Google this without personalization, just in case there actually is.
So there’s the SES actor, there’s me.
There’s me again, this is logged out.
This is also me.
So Bard has pretty clearly hallucinated pretty badly, actually.
So let’s go ahead and check out Bing.
Let’s see you have Alex did marketing data technology data science? I like this.
You can see when you hover over and Bing, this the sources I have.
This is correct.
This is correct.
So Bing also did a great job.
So, so far it is one point for GPT-4, 1.4, Bing, and zero points for Google.
I should have a little scoreboard.
Let’s go ahead and do this next question now, which is which the better platform for managing an online community? And let’s go ahead and get GPT-4 working? Actually, I’m going to switch to GPT-3.
Point five because it’s faster.
No, no, y’all, so let’s stay with for, let’s say a four because it is it is the model that people are paying the most attention to, will go and closes the tab.
Alright, so it’s going to get churning on that wall.
Let’s get barred.
cranking away here.
Let’s get barred going and then get Bing going.
See, this is very interesting with Bing.
You can tell but it’s not using GPT-4 as the language as as the data source.
It is essentially querying its own knowledge base first, getting the results and then putting that through the language model to craft the content itself.
This is a very clever implementation of the GBT language model.
Okay, Slack was designed primarily for working teams and professional communities, whereas Discord is often used by friends groups of friends, Discord is ideal for online communities.
While slack is better for businesses telegram is another platform.
Each platform has its own strengths and weaknesses, that that is a technically correct answer.
It’s not a satisfying answer, but as the technically correct answer.
Yep, again, same.
If you’re looking for platforms secure and private, Telegram may be a good choice.
I would argue that Discord is also a decent choice.
Let’s go ahead and see how GPT-4 is doing.
No one size fits all.
Here’s a brief overview of the three platforms, pros and cons.
highly organized, robust research capabilities limited in the free version less suitable for large public communities initially created for gamers.
engaging experience can be overwhelming to its feature with less promotional focus on professional communities, good for large public communities.
And then telegram we have high emphasis on privacy and security supports large groups and channels limited trims integrations customization.
So this is interesting.
Bing, and Google both gave satisfactory answers answers that I would say would be acceptable, right.
Google doesn’t cite any sources here.
You have to Google it separately.
Bing cites a bunch of different sources, which I think is very handy because you have to you can decide is a trustworthy source or not.
I also do like the fact that it returns videos there.
And then, of course, OpenAI returning a really, really robust answer.
I’ll give all three a point for this.
But I will say that I think in terms of not credibility, but in terms of thoroughness, OpenAI wins, this wins that round, but we’re gonna give them all to a point now like me will give OpenAI a plus one.
So let’s do that.
I need to keep score here.
So we have Bing, Google, BERT OpenAI in the first round who will go to zero.
Everyone gets a one here.
But OpenAI gets a two because again, that’s a really nice, thorough answer that is very satisfactory to the end user.
Remember, we’re not looking at this from the perspective of marketers.
We’re looking at this from the perspective of would an end user find this satisfactory? Number three, infer the first name and last name for the following email address.
Let’s go ahead and get OpenAI cranking.
Let’s get Bard cranking and let’s get Bing cracking.
See what this does.
First name is Christopher and the last name is Penn.
We like that Bing, you got to point to my clipboard here.
Just hold was Chris for last there’s like the pen is is is the same email as is the same as the email address domain.
What that means, but you did correctly infer the answer.
This is nice OpenAI.
Everybody gets a point on that round.
Let’s move on to the next question.
Who is President United States? So it’s a hallucinatory question.
So let’s go ahead and get each one cranking away here.
Do a Google comes up with so this has been let’s see, Bing comes up with there was no president the United States was established in 1789.
So Bing gets a point.
First Question for my coffee cup.
Let’s go ahead and check in on Google.
There was no president, the United States that is correct.
And OpenAI also gets a point.
I liked this extra detail during 1566, North America was inhabited by various indigenous peoples and was being explored and colonized by Oh, that’s correct as well.
Everybody gets a point for that question.
There is a belief let’s do the traumatic event.
traumatic event and conservative us ones actually, let’s go ahead and start a new chat because it’s a very different question.
So I’m gonna go ahead and start a new chat here.
Let’s reset chat here.
And let’s go into being cleaned up and Okay, so let’s see what this comes up with.
Interesting that Bing is having to run multiple searches to try and get an answer here.
We have APA, Scientific American good sources, Hailes.
Conservatives bolster arguments for trauma therapists forgotten memories.
Okay, so there’s some.
There’s some decent stuff here from APA.
Let’s go ahead and look into Google.
There’s a growing body of research, journal politics, American political science journal, political science, September 11.
That is correct.
They did specify cite your sources, and Google has not done that.
American Political Science reviews that state of British political science after the London bombings, okay.
And now let’s go check it out, OpenAI body of literature evidence has mixed political consequences.
I like this.
This is good citation right here of trauma and political act attitudes, like intergroup trauma in American support for the war.
So in the responses themselves.
Google did not cite sources, it mentioned them but these are not citations.
Just that’s not particularly good being it’s a toss up on Bing, because it does provide links to everything, but it doesn’t put it in line.
So I would say, I would say for this one, I’m gonna give Bing a zero because, again, we’re looking for citation, not just commentary, and with OpenAI, you can go and google authors and find it so OpenAI will get the point for this round.
Opinion question is a martini made with vodka.
Actually a martini ahead and going ahead and get all three of these you’ve Google’s thinking about whether Mr.
T MAE vodka is actually Martini as a matter of opinion that is correct.
Some people believe it must be made with Jenna others believe it can be made with vodka there can be there’s no right or wrong us.
I mean, technically, gin was, was the original spirit used in the Martini, right? Fuck as popular spirit fog as a neutral spirit.
Okay, so it is a matter of opinion.
Google gets appointed for this round.
Let’s go ahead and check in on open AI.
The question whether Martini vaca is as some debate traditionally made with gin vermouth? That’s correct.
Here’s a few sources explore this answer.
The vodka martini have refreshed history of the Martini.
OpenAI gets the point for this round.
And Martini is traditionally a gentleman with have often martinis technically speaking, a martini is not actually martini, but rather variation of it.
Being gives a definitive question, answer.
It’s a variation of a martini.
So I would I’m gonna give everyone gets a one, Bing two points because it is technically correct.
Let’s go ahead and clear our histories.
Let’s see clear conversations and reset chat.
Let’s move on to the next question.
You will act as a content marketer it is generation time.
Let’s go ahead and have Google tell us the history the likelihood of a future of SEO and go into being here.
Let’s let’s clear.
Anything up good.
All right, let’s take a look in now, OpenAI is going to crank for a while on this because it is a slower model.
But we’re what we’re specifically looking for in this one is a couple things.
We’re looking for it to not read Eat just regurgitate old information.
We’re looking for something that evinces even the slightest hint of original thought.
All right here we have Google’s.
So Google is done already, which is impressive.
Bing is done already.
And then OpenAI, of course, is going to be cranking for quite some time.
Let’s read Google’s history a future of SEO futures is constantly evolving.
create high quality content, use key words.
That’s this is like 2005 SEO building backlinks.
In a generation of AI.
You use AI power tools rise Voice Search, which was five years ago.
Okay, so Google cranked out a fast article, but there’s nothing here that’s useful.
This is this.
This could have been written in 2010.
So this, I’m gonna give Google a zero on this one.
Yes, it did the job but it did a pretty poor job.
OpenAI still working on it.
Let’s check in on Bing.
Bing says the future is rapidly changing with the rise of genuine AI is important for marketers stay ahead of the curve, shift towards Voice Search, blah, blah, blah.
Yep, visual search, which again, was about five years ago generative as think about the way of high quality content, content engaging.
Again, cranked it out fast.
But there’s nothing new here at all voice and visual search.
This is this is five years old.
Okay, so let’s go to OpenAI.
We have a growing role of AI and SEO and SEM.
AI becomes able to understand content, high quality content, the past scrutiny, advanced advanced AI algorithms.
semantic search and natural language processing semantic search is.
It is it’s not new, but it is somewhat helpful.
There’s the voice search business again, UX and core what vital is that was three years ago, general AI.
I liked that it cites itself in here.
Preparing for the future of SEO, use generative AI tools to system content creation.
So if we think about this, if you go back here and look, so Google doesn’t even attempt to tackle generative AI.
Bing kinda does.
And then, OpenAI talks about using generative tools like GPT-4.
I’m gonna give OpenAI the point here.
Gonna give me a half a point, because again, a lot of that stuff is is old, right? It’s not really new.
Things future of SEO, we’re, you know, you should be talking about the content shock aspect, which is you just a tsunami of generated content.
All right, let’s do some guessing.
Let’s play some guessing games.
I’m gonna go ahead and clear the chat.
I probably should just get in habit clear in the chat after every query.
So let’s go ahead and reset the chat.
And we’re going to ask about future presidential candidates.
Now, this has the potential to be hallucinatory will also be interesting to see you know how it thinks about answers.
The shaping up.
Former President Trump’s income and Joe Biden’s life candidate Ron DeSantis.
This is current as of March 24, you may get closer.
That’s a decent answer.
I will give Bing a point for that is a very decent answer.
Let’s check in Google says President Biden former President Trump Governor DeSantis Gretchen Whitmer.
Senator Elizabeth Warren said Bernie Sanders.
Yeah, Bernie runs all the time.
I would say again, Google gets a point for this.
I think those are reasonable answers.
See, interesting that this some different answers from OpenAI, Kamala Harris and Christianorum are in here as long as the big three.
All three of us get a point.
Interesting responses to the question, interesting responses to the question.
All right, factual question that deals with misinformation.
Let’s go ahead and go into OpenAI.
ARD and Bing, okay, where they’re most effective measures to prevent COVID from Google vaccination, wear a mask.
Washing wash hands is ineffective against COVID Because COVID is airway.
Avoid crowds stay home if you’re sick practice social distancing.
Again, social distancing.
You It was kind of refuted after 2021 Mainly because it’s an airborne disease COVID like cigarette smoke.
You’re you’re going to inhale a little bit less smoke being six feet away from someone but you’re still going to smell like smoke.
But Google’s advice is correct.
It gets a point.
Let’s see, OpenAI still thinking and we have Bing wear a mask, stay with outbreak spots frequent hand washing, improve ventilation.
vaccines are safe and effective cruising ILS but is not on the list being gets to zero that is unhelpful advice.
And wear a mask is correct event poor ventilation is correct.
Vaccination is the last line of defense and should be something that is important.
It’s missing from here.
OpenAI vaccination, hand hygiene, respiratory etiquette, face masks, social distancing.
Clean and disinfect regularly.
See, that’s all the fomite stuff from early on.
poorly ventilated faces.
OpenAI get stuff.
I’m gonna get OpenAI two points because it nailed all three ventilation vaccination and masks.
So interesting that that Bing, Bing search results kind of holes that I thought that was kind of interesting.
Okay, let’s go ahead and start a new chat here.
Let’s clean up our bar chat.
Our next question is what’s the best way to poach eggs for novice cooks? So Google gets going on that saying, Let’s go to GPT-4.
And let’s go into edge.
That’s where to poach eggs for novice cooks.
for search engines, you would expect them to return some videos I think that would be a very helpful thing to do.
Let’s see what happens.
Bring a large pot of water to boil crack an egg to a fine mesh sieve of a small liquidy whites.
Transfer the egg to a small American add vinegar.
Slide the egg in the center.
The Whirlpool is correct if you’re doing it one egg at a time.
And there’s a whole bunch of videos that is a terrific answer.
We like that.
But I’ll give Bing a point for that.
Let’s see what else opening I still thinking.
Let’s see we got Google here fill saucepan three inches of water and one tablespoon white vinegar reduced heat to low crack an egg to a small bowl slight with the water.
Yep, smooth with a slotted spoon tips.
This is a good answer.
This is a very good answer.
Google point there, no videos no sources but it’s a good answer.
And OpenAI water temperature add vinegar crack the egg.
Okay, now give OpenAI the point for that as well.
It’s taking a bit of time to thank you while it is thinking.
Let’s take a look at the last question on a list.
This is a generative question a specific output format.
So we’re gonna see if we can do this.
Okay, you know, we’re good.
I think we’re good.
Let’s go ahead and clear conversations new chat.
And let’s go ahead and put in the generation to chat.
Google Bard, and go to Bing.
And we are looking for his very specific returned format here pipe delimited format.
The company name year founded annual revenue position on listed website domain name.
This is nice.
I don’t want the row numbers, but that’s fine.
Fortune 10 as a 2022.
This is looking very, very nice.
Bing gets full marks full point for that.
Let’s go ahead and check in on Google Bard.
Nope, Google gets a big fat goose egg for that one.
Yeah, that’s that’s unhelpful.
And open AI.
So this is again, it’s run to the knowledge wall of 2021 which is fine.
Format is looking good.
So OpenAI gets full marks for that.
So let’s do some quick tallying.
So Bing gets nine points.
Let’s do Google 1234567.
Google had seven points, and OpenAI.
1-345-678-1011 12 and a half.
So are our final scores for the GPT-3 bakeoff.
Large language model bakeoff is in first place, OpenAI is GPT-4 with 12 and a half points, second place Bing with nine points and Google Bard in third.
As with seven points, I will say.
OpenAI is models, the GPT models.
They are not search engines.
They’re not designed to be search engines.
They are designed to be transformed as generative AI models.
That said, they are substantially better than the search engines.
In terms of the quality of results, they return in terms of the usefulness of the results they return.
So that I think that’s a really important thing to look at.
I am surprised pleasantly by Bing.
If chat based search is the way to go for the future, if that’s something that people are going to want to do, Bing does a really good job.
It cites it sources, it makes it sources obvious from the get go like when the COVID example, you could see which sources it was drawing from you’re looking for authoritative sources, or doesn’t have that.
And I am equally surprised, shocked that Bard is so far behind.
This is Google, this is the company that practically invented modern search.
And yet, they’ve really fallen far behind bars results are unhelpful.
There’s a lack of citation, there are things that just flat out gets wrong.
And yes, all these experiments, all these are in development, all of these moving objects.
But if there was a company that would expect to get right based, just the sheer amount of data they have access to, it would have been Google.
And instead, Google comes in in third place in this Bake Off, so I am surprised, I am disappointed in Google for sure.
I am not surprised by GPT-4.
Yes, it is slow, right? We could probably do this with GPT-3 point five as well, if we want to do that bake off, but the quality makes up for it.
And if I had to pick today, a search engine to use for answers.
Using chat interfaces, it would be Microsoft Bing, and I never in my life thought I would say that because Bing has always kind of been this the other search engine like the other white meat.
And yet, they’re the way they have engineered this with the GPT-4 library.
Makes it really good.
It makes it is good enough that I would consider using it as a substitute for Google, particularly for complex queries queries where I want a synthesized answer that still has sources.
So that is the large language model Bake Off.
I hope you found this helpful and useful.
And I look forward to your feedback.
Talk to you soon.
If you’d like this video, go ahead and hit that subscribe button.
You might also enjoy:
- Understand the Meaning of Metrics
- B2B Email Marketers: Stop Blocking Personal Emails
- How I Think About NFTs
- What Content Marketing Analytics Really Measures
- Retiring Old Email Marketing Strategies
Want to read more like this from Christopher Penn? Get updates here:
Get your copy of AI For Marketers