In today’s episode, I discuss why AI struggles with sarcasm and tone. I explain how statistical models can’t detect subtle inflections that change meaning. While today’s text-based AI misses nuance, future multimodal systems will interpret tone and context better. Tune in to learn why your AI tools may misunderstand certain inputs.
Can’t see anything? Watch it on YouTube here.
Listen to the audio here:
- Got a question for You Ask, I'll Answer? Submit it here!
- Subscribe to my weekly newsletter for more useful marketing tips.
- Subscribe to Inbox Insights, the Trust Insights newsletter for weekly fresh takes and data.
- Find older episodes of You Ask, I Answer on my YouTube channel.
- Need help with your company's data and analytics? Let me know!
- Join my free Slack group for marketers interested in analytics!
What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.
In today’s episode, let’s talk about why artificial intelligence generative AI struggles with certain types of language language like sarcasm.
The problem is statistical.
The problem is mathematical and the problem is multimodal communication.
So let’s talk about what this means.
Any sentence in the North American English language, for example, can be dramatically changed by intonation, even though English is not a tonal language, meaning the words don’t change meaning, because of the way you pronounce the word.
For example, languages like Chinese intonation is very, very important.
If you get the wrong intonation, you might mean to say mother you end up saying horse.
Instead, we would, we have some words like that, but not very many for the most part, there’s a strict semantic meaning to the words that we say I could say mother and horse.
And they’re distinct, right? No matter how much I change the pronunciation of those terms, they still pretty much mean the same thing.
There are exceptions, of course.
So in languages where you have very strict semantic meaning, and the intonation doesn’t change a whole lot.
Machines have fairly good statistical distributions, right? They can understand that you say I pledge allegiance to the the next word probably is going to be flagged.
I say God save the the next word is probably going to be either king or queen, it’s unlikely to be rutabaga, right? However, a lot of the meaning that comes out of language is also still based in tone, not because of semantics, but because of literal sound, right, the type of sound that we make with a sentence.
For example, let’s say, let’s say, I really liked that pizza.
I don’t know why I keep going back to pizza.
If I say I really like that pizza, that’s a fairly neutral sentence, right? It’s a fairly neutral tone.
And you can, if you were a speaker of North American English, you can pretty much take it at face value that I liked that pizza.
If I say, I really like that pizza, same words on paper machine would see them the same way statistical distribution is exactly the same.
But the intonation is different.
The intonation communicates some of that sarcasm, right? That says, Yeah, I actually didn’t like that pizza.
But a machine, a large language model, today’s text based large language models can’t hear, they can’t hear me say that.
And as a result, they don’t understand that I’m actually negating the meaning of the text itself.
Right? Think about if you’re joking around with a friend and you do something, and that friend just goes, Oh, my God, I hate you.
Right? They don’t actually hate you.
Hope not anyway.
But the tone in which that’s delivered is enough for you to know they’re kidding around as opposed to you can imagine somebody just shouting at someone.
Oh, my God, I hate you.
Right? That is very different.
That communicates more true to the meaning.
And so this is the challenge that generative AI today faces with the use of text being a text medium.
Text is code, right text is programming code.
We program each other with language and we have to do a lot of language tricks when we’re just communicating purely in writing to communicate those tones because it’s not apparent otherwise.
If you read the text messages of people or messages in discord or slack, half of the usage of things like emoji is to communicate tone in a way that you can’t just with text.
If you read really well written fiction, you have to have a lot of description and a lot of context to understand what a character is saying.
And even then, it can still be very ambiguous, right? If you if you watch an interpretation of a text in video, for example, take the Lord of the Rings, right? The way Tolkien wrote is not necessarily what is on screen.
And so there’s a lot of interpretation that people have to take from the source text, when they bring it to the screen to make editorial choices that this is what the author meant.
And that may or may not be the case, right? When when movies like Lord of the Rings were produced, you know, Tolkien had long since passed away.
So there was no way to go back to him and say, was this actually what you meant in this text? Now, again, with skillful writing, you can communicate some of that tone, some of that context, some of the things that would indicate sarcasm, you might say, going back to example four, Oh, I really love that pizza, he said with a smirk, right? Or he said rolling his eyes, we have to provide the extra description in text to communicate those non verbals.
But if we’re doing things, for example, like processing transcripts, or any other spoken word, where tone is being communicated, our machines are going to go awry, right? Our machines are not going to interpret them well right now.
Now, here’s the thing that’s going to change.
It is already starting to change because language models are becoming multimodal models, you have models like lava, or GPT, 4v, that can see and read, right? So they can take a text input, and visual input and mix the two.
It is not a stretch of the imagination to have a text model combined with an audio model, so that a machine can listen to that intonation and understand the difference between I hate you, and I hate you, right? Same words, same statistical distributions, but very different meaning based on intonation.
If you are running into cases where you are not getting the results out of a language model that you want, especially if you’re doing generation in the writing of text, consider how much non verbal communication is going into the writing that you’re doing.
And then you may have to prompt it to, to fill in some context that isn’t necessarily there.
Even if you’re using it in a marketing or business sense, remember that marketing and business are still human communication, there’s still a lot of that nuance, and that lot of non text communication, that if you’re not getting the model to do what you want, you might be running into needing to pull some tricks out of fiction, out of fiction writing in order to make the models work better.
Something to think about as you’re trying these things.
But that’s one of the reasons why today generative AI struggles with sarcasm, and why in the future, it may struggle much less.
Thanks for tuning in.
Talk to you next time.
If you enjoyed this video, please hit the like button.
Subscribe to my channel if you haven’t already.
And if you want to know when new videos are available, hit the bell button to be notified as soon as new content is live.
You might also enjoy:
- The Basic Truth of Mental Health
- B2B Email Marketers: Stop Blocking Personal Emails
- It's Okay to Not Be Okay Right Now
- Understand the Meaning of Metrics
- What Content Marketing Analytics Really Measures
Want to read more like this from Christopher Penn? Get updates here:
Get your copy of AI For Marketers