## What’s On My Mind: How Large Language Models Work – A New Explanation

I’ve been working on a more thorough way of explaining how large language models do what they do. Previously, I had explained how large amounts of text were digested down into statistical representations, and while this is accurate and true, it’s hard for people to visualize. So let’s tackle this in a new way, with word clouds. Now, to be clear, this is a vast oversimplification of the mathematics behind language models. If you enjoy calculus and linear algebra and want to dig into the actual mechanics and mathematics of large language models, I recommend reading the academic paper that started it all, “Attention is All You Need” by Vaswani et al.

Take any word, and there are words associated with it. For example, if I give you the word marketing, what other words related to it come to mind? Digital marketing, content marketing, email marketing, marketing strategy, marketing plans, marketing template, sales and marketing – the list goes on and on, but there are plenty of words that are associated with the word marketing. Imagine that word, marketing, and the words associated with it as a word cloud. The words that occur the most around marketing are bigger in the cloud. Got it?

Next, let’s take a different word, a word like B2B. When you think of words associated with B2B, what words come to mind? B2B marketing, sure. B2B sales, B2B commerce, B2B strategy, and so on and so forth. Again, picture that word and all its associated words as a word cloud and again, the words that occur the most around B2B are bigger in the word cloud.

Now, imagine those two clouds next to each other. What words do they have in common? How much do they overlap and intersect? B2B and marketing share common words in each other’s clouds like sales, commerce, strategy, etc. Those words have an increased probability when you mash the two clouds together, so you could imagine those words would get even bigger.

And that’s the start of how large language models do what they do. Large language models essentially are composed of massive numbers of word clouds for every word they’ve seen, and the words associated with those words. Unlike the toy example we just showed, the way these models are made, each individual word’s cloud is composed of tens or hundreds of thousands of additional words. In the largest models, like GPT-4, there might even be millions of associations for any given word, and those associations also occur among words, phrases, sentences, and even entire documents.

For example, there will be multiple associations for a word – apple could refer to a fruit or a computer company, and the words around apple determine which association will be used. Each of these clusters of association exist inside a large language model as well, which is how it knows to mention Steve Jobs if your prompt contains both apple and computer along with other related words, even if you don’t mention Steve Jobs by name.

When you use a tool like LM Studio or ChatGPT or Google Bard, and you give it a prompt, it goes into its library of word clouds and takes each word from your prompt, extracts the relevant word cloud associations, mashes them all together, and the intersections of all those words are essentially what it spits out as its answer, formatted in the language of your choice. This is why these tools are so effective and so powerful – they have a knowledge of language based on how a word relates to every other word that’s nearby it in millions of pages of text.

This is also what makes the difference between good prompts and bad prompts, between non-creative and creative responses. Think about it for a second. If you write a short, boring prompt, it’s going to create a mash of word clouds that is relatively small, and that means only the most frequent (and therefore boring and non-creative) words will be returned. “Write a blog post about the benefits of email marketing” is going to generate some really mediocre, boring content because it’s a mediocre, boring prompt that will return high-level word cloud mashups only. True, there will still be hundreds of thousands of words in the combined cloud of a prompt that small, but because we’re thinking about the INTERSECTIONS of those clouds, where they overlap, you’re not going to get much variety or creativity:

If you used a prompt like “You are a MarketingProfs B2B Forum award-winning blogger who writes about B2B marketing and email marketing for the industrial concrete industry. Your first task is to draft a blog post about the benefits of a high-frequency email marketing program for an industrial concrete company that sells to state and local governments; focus on unique aspects of marketing the concrete industry and heavy construction. You know CASL, CAN-SPAM, and GDPR. You know email marketing best practices, especially for nurture campaigns in marketing automation systems. Write in a warm, professional tone of voice. Avoid tropes, jargon, and business language. Avoid adverbs.” How many of these word clouds will be created with a prompt this large? Many, many word clouds, and each cloud of associations will have overlaps with the others. The net result is you’ll get a much more tailored, unique, and creative result.

When you understand conceptually what’s going on under the hood of large language models, it becomes easier to understand how to use them to the best of their capabilities – and why non-language tasks simply fail most of the time. For example, math is really hard for many models to get right because they fundamentally don’t do computation. They’re predicting the likelihood of characters – numbers – and the numbers that should be nearby. That’s why earlier models had no trouble with expressions like 2 + 2 = 4 but could not do 22 + 7 = 29. The former equation occurs much more frequently in written text, while the latter is fairly rare by comparison. The model isn’t performing any calculations, and thus tends to get the answer wrong.

This is also why censorship is so damaging to the structure of these models. Take any common profane word, like the venerable F word. How often do we use it? How many other words are associated with it? If you were to try ripping it out of a combination of word clouds, how many other words might get ripped out too – and are they useful words otherwise?

That’s also why models behave less or more creatively. They’re not intrinsically creative; they’re simply clouds of probabilities being mashed together. When you give an non-creative prompt, you invoke only the most broad probabilities, and you get a non-creative result. When you give a highly creative, relatively rare prompt that has many combinations of many specific words, you invoke very specific probabilities and get more creative results.

Large language models are libraries of probability, and every time we use them, we are invoking probabilities based on the words in our prompts. If we aren’t getting the results we want, we should examine the words, phrases, and sentences in our prompts and adjust them to add more detail until we get what we want. There’s no magic formula or secret guide to prompt engineering, no “Instant Success with ChatGPT” that has any serious credibility. If you have conversations with these models that use the appropriate language to get all the word clouds to overlap well, you’ll get what you want from a large language model.

