I am bullish on AI creating content at scale.
I am bearish on AI creating GREAT content at scale – or at all.
Why? It comes down to limits of training data, fundamentally.
All machine learning models, from the most basic to the most sophisticated, need something to learn from. In the case of language generation – automated creation of content – they need tons of examples to learn from.
And therein lies the problem.
Before we go farther, let’s define great content as content that’s well-written with a unique point of view and a distinct voice. That part is important.
Content Quality in the World
When it comes to publicly available content, there are two potential distributions, a power law distribution and a normal distribution.
A power law distribution looks like this:
This is also known as an 80/20 rule or a 95/5 rule; fundamentally, the amount of poor quality content dwarfs everything else. The amount of great quality content is on the right hand side – and it’s very small.
A normal distribution looks like this:
In this case, it says there’s a small pool of absolutely terrible content, a massive amount of mediocre content, and a small pool of absolutely great content.
Whichever distribution we think represents reality, there’s very little great content compared to everything else – which means machines have very little great content to learn from.
And if there’s an insufficient amount to learn from, then the machines will not be able to synthesize great new content. They will be able to synthesize mediocre content or poor content.
How Much Content to Train?
How much content are we talking about in order to train a natural language model? The Pile, an 800 GB dataset created by Eleuther.ai, is a training dataset composed of 22 libraries:
The largest item in The Pile is the CC, the Common Crawl, derived from the Wayback Machine and Archive.org. That means it’s ingesting a huge amount of web text from all over the web – and of substantially varying quality. OpenWebText2, according to the documentation, is another scraping of web content based on Reddit upvotes.
All this indicates the level of quality of the training data. The folks who assembled this training dataset, like the other major natural language models, have done their best to filter out the bottom of the barrel, the absolute garbage that would do more harm to a natural language model than good. So we can be fairly confident in a normal distribution in terms of content quality; after all, YouTube subtitles, US patents, and medical papers are important documents but not exactly riveting reading most of the time.
What isn’t obvious from the table above is just how little data we have for a specific example. The Common Crawl is 227 GB of data, with an average document size of 4 KB. What that works out to is a dataset of 56.7 MILLION pages. That’s how many web pages are needed in just a portion of the training set.
The overall dataset is 825.18 GB, with an average document size of 5.91 KB. That’s 139 MILLION pages of text. That’s what is needed to construct a language model.
Now, let’s say for giggles that you think your website is the bee’s knees, that your content is A+ content. Let’s use Social Media Examiner, a well-regarded blog, as an example. How many pages are on this site? About 62,000 per the AHREFS SEO tool. When you think about it, that’s a tiny amount of text. Even the Reddit-filtered OpenWebText2 portion of the pile is 16 million pages.
In terms of a unique voice, let’s use this blog. I’ve been blogging here consistently since 2007. How many pages of content have I managed in that time? 8,954. That’s not a ton of content in the big picture compared to what is used to train machine learning models.
Which means that natural language models will inherently be biased towards creating mediocre content, content that’s readable and coherent, but not compelling or unique, because that’s what the vast majority of the language is that they are trained on.
Let’s illustrate the limits of language models trained on a diet of mostly mediocre data. First, let’s take a mediocre use case – writing a press release. I’ll feed the GPT-J-6B natural language generation model the following text from a press release:
NEW YORK–(BUSINESS WIRE)–Today, Effectv, the advertising sales division of Comcast Cable, released the latest edition of its TV Viewership Report. The report explores viewership trends for the first half of 2021 based on Comcast aggregate household viewership data, as well as insights based on Effectv’s campaign ad exposure data for both traditional TV and streaming content.
According to an analysis of more than 20,000 Effectv cross-platform campaigns, streaming consistently served as a strong complement to traditional TV campaigns in the first half of 2021. In fact, 57% of reach from streaming in those campaigns was incremental to linear TV campaigns. Additionally, streaming impressions were +209% more likely to be served within households that viewed little or no traditional TV. This proves that complementing traditional TV strategies with premium long-form streaming will grow incremental reach and increase exposure with households less likely to be tuning in the traditional way.
This is only half the press release; I’ve highlighted in yellow the part I’m feeding the AI:
I want the AI to autocomplete the rest of it. Let’s see how it did:
“Our data shows that traditional TV and streaming are complementary platforms, which is why we’ve invested so heavily in both,” said Mark O’Neill, senior vice president, Ad Sales. “We’re seeing strong results with streaming, and we look forward to sharing more insights at the upcoming IBC in September.”
TV Viewership Trend Analysis
With the recent U.S. restrictions on travel and gatherings, viewership of national, regional and local channels has been declining in most markets. For those channels that are continuing to deliver, the decline is expected to be between -7% to -19% in 2021.
For the first half of 2021, total viewership of all channels declined by +3% from the first half of 2020. Viewership of NBC-owned networks (NBC, Telemundo and CNBC) declined by -15% to -11%. Viewership of Fox-owned networks (Fox, FX, FS1 and FS2) declined by -7% to -10%. Viewership of ABC-owned networks (ABC, Freeform and ESPN) declined by -7% to -9%. Viewership of CBS-owned networks (CBS and CW) declined by -4% to -9%. Viewership of the CW declined by -6% to -4%. Viewership of Univision networks (Univision, UniMas and Galavisión) declined by -3% to -2%. Viewership of PBS declined by -3% to -2%. Viewership of independent channels declined by -6% to -13%.
In total, the largest decline in viewership among the national and local channels was among NBC, ABC, CBS and CW networks. The Fox network showed the second-largest decline, followed by Univision and PBS. These results are consistent with the recent ad spending trends on these channels.
With the growth in streaming services such as Netflix, Amazon Prime Video and Hulu, streaming impressions have grown at a rate of +11% to +21% year-over-year. Additionally, impressions of streaming campaigns were +21% more likely to be served in households that viewed little or no traditional TV, which is a reflection of the strong viewing patterns of streaming. In contrast to streaming impressions, impressions of traditional TV campaigns showed a decline of -3% to -4%.
What we see is that for a mediocre piece of content – a bland press release – the AI did astonishingly well. The data is likely not factually correct, but it captured the tone quite well. This would, at the least, be a great first draft that an editor could tune up and be ready to publish in short order.
Now, let’s try it with something more unique and distinct. I subscribe to and enjoy Tom Webster’s weekly newsletter, I Hear Things. It’s a fantastic newsletter about all things audio, so if you care about stuff like radio, audio, audio ads, podcasting, social audio, etc., you should subscribe to it. One of the things that’s distinctive about Tom’s newsletter is his written style. It’s very unique. Here’s a sample of a passage from the most recent issue:
This can be a difficult statement to fill in for the average podcaster. What is the thing that only your podcast provides for its intended audience? Not being able to formulate this kind of statement for your show is problematic not only as a hinderance to promote your show, but also as a trailing variable for a deeper problem: an inherent lack of differentiation with the show itself.
It’s definitely a thing I thought about first when I decided to create a podcast version of this newsletter. I started the I Hear Things podcast initially as a service for those who told me they sometimes didn’t have time to read my newsletter, but could always make time to listen to it. Thus, the podcast was born, but “The Podcast For People Who Like To Read I Hear Things But Don’t Have Time” is not the most compelling pitch I’ve ever heard. It probably ranks right ahead of “The Podcast That Won’t Give You Leprosy.” This is both of those things.
But there was an ulterior motive to the podcast.
As before, I’ll highlight the part I’m feeding to the machine:
Now, let’s see how well the state of the art natural language generation model does, a model trained on all the data I shared earlier:
But there was an ulterior motive to the podcast. When I first started listening to podcasts, I found most of them to be like a car commercial; they all talk about the same things and the people that they’re talking to don’t really give a shit. I’m not sure why, but this seems to be a common pattern. You’ll be watching a movie and it’ll be full of some bullshit talking heads that go on about some movie you’re about to watch. You’ll have the same thing when you’re listening to an audio book. Or, you’ll be at the gym listening to some exercise podcast that goes on and on and on about the best workout routines and all of that stuff. I was sick of that, so I made I Hear Things.
And when I tell people about I Hear Things, I always get an equally strong response that it’s about the opposite. It’s not all about what we’re gonna talk about, it’s about what we’re gonna NOT talk about. I Hear Things is all about different people from different walks of life, so when I was doing it, I had no idea that I would discover the show that I Hear Things eventually became. But in the end, it has. It has a unique tone, and I know what I’m getting with this show. I want you to do the same. If you make a podcast about the things that you’re interested in, chances are you can find some people who are like minded and who are also interested in the things that you’re interested in. If that’s you, then the I Hear Things podcast might be for you. And I want to hear from you, the listener. So if you’re interested in what the I Hear Things show is about, give us a call at 1-800-917-3116.
What the machinery came up with was coherent and readable, but it was not at all close to what Tom wrote. It also changed voice – Tom’s tone and use of language is very specific. It’s conversational to a degree, but with a sense of formality, a sense of poise as though each word was chosen carefully. What the machine spit out is… not that. It comes across more like what a Joe Rogan or a Howard Stern might write.
Because Tom’s writing style is so distinct, it would be very difficult to build a model around it. It would even be tricky, because he’s written comparatively little, to fine-tune a model around it – and doing so would be at a very high computational cost. It can be done, but it’s not something the average marketer is going to have the time or talent to dol
This is why I’m bullish on AI for general, mediocre content creation at scale. It’ll do a great job of it, and be affordable for everyone. But this is also why I’m bearish about AI creating something truly great, because greatness is scarce. The more scarce, the harder it is for anyone – man or machine – to replicate it.
One final bit of food for thought: if you feed your own writing into a machine learning model and what comes out is equal to or better than your writing, that’s probably a sign that you need to level up your writing. Your writing sits in the middle of the bell curve, and for both personal and professional reasons, it needs to move to the outer edge of excellence.
You might also enjoy:
- The Biggest Mistake in Marketing Data
- You Ask, I Answer: Google Tag Manager and Google Analytics Integration?
- Marketing Data Science: Introduction to Data Blending
- It's Okay to Not Be Okay Right Now
- B2B Email Marketers: Stop Blocking Personal Emails
Want to read more like this from Christopher Penn? Get updates here:
Get your copy of AI For Marketers