You Ask, I Answer: Multilingual Outputs in ChatGPT?

Warning: this content is older than 365 days. It may be out of date and no longer relevant.

You Ask, I Answer: Multilingual Outputs in ChatGPT?

Xiaoli asks, “will the GPT output result differ for different languages? for example, will the GPT result in English better than the result in Chinese?”

In this episode, I discuss whether the GPT output differs for different languages. The majority of machine learning is biased towards the English language, which has become the lingua franca of the modern technology world. Translation models and the GPT family of models do not do as great a job going from English to other languages as it does from other languages to English. It varies by language, and the smaller the language’s content footprint, the worse the models perform. However, over time, expect models specific to a language to get better as they ingest more content and understand more of what is published online. Watch the video to learn more about language biases in machine learning and artificial intelligence.

This summary generated by AI.

You Ask, I Answer: Multilingual Outputs in ChatGPT?

Watch this video on YouTube.

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.

Christopher Penn 0:00

In today’s episode Xiao Li asks, Will the GPT output differ for different languages? For example, will the GPT result in English be better than the resulting Chinese? Yep, the majority of machine learning a substantial amount of machine learning and artificial intelligence is very, very heavily biased towards the English language.

English has become sort of the ironically the lingua franca of the modern technology world, right? Where a lot of work is being done in English code is written and documented in English, many of the major open source projects are tend to be English first.

So it stands to reason that the amount of content online that was scraped to put together these models is biased towards English as well.

And we see this to be we know this to be true.

And you’ll look at translation models and how the GPT family of models translates.

It doesn’t do as great a job going from English to other languages as it does from other languages to English, test it out for yourself, find some friends who speak multiple languages, and do some bilateral testing have the GPT model translate something from another language into English and have it translate from English to another language and see which one comes up with a better output.

And it varies by language.

It is not consistent, right? It is not the same percentage of not as good with say like Chinese where there’s a ton of information as there is with language like Swahili, or Tibetan.

The smaller languages content footprint is the worst the models do add it.

Particularly when you look at stuff that is stored and things like academic papers, which is where a lot of more obscure languages come from.

The GPT series of models, for example, has, it can do Sumerian, it can’t do a Syrian, it can’t do Babylonian, even though these are known languages, and it struggles with smaller dialects.

So it won’t do as good a job with Koine Greek as it will with modern Greek.

Ultimately, though, there’s a very heavy bias towards English.

Even I think it’s something like only 20% of the world.

English is is the the major language there.

Most of the world does have some level of capability in English in some fashion for a lot of the the bigger economy nations.

But obviously English is not the first language in those places.

But English has dominance right now in technology because of the nature of technology where a lot of tech industries got started, will that change? Probably.

I mean, China itself is cranking out huge numbers of AI scientists and stuff and I would hold the expect really good large language models to be built in Chinese First, I would expect the same to be true for Hindi and Urdu, right? country of India has 1,000,000,003 1.3 billion people or something along those lines.

Just on numbers alone, they will crank out probably more AI specialists than say, a country like the USA which has only 330 million people, it’s just a numbers game.

So over time, expect those models to get better expect models that are a specific language first, but also expect the GPT series and the big public models to get better as well as they ingest more content as they as they understand more of what is published online.

Good question.

If you’d like this video, go ahead and hit that subscribe button.

Machine-Generated Transcript

Christopher Penn 0:00

Comments

Leave a Reply Cancel reply

Pin It on Pinterest