Almost Timely News: Why Mistral’s Mixture of Experts is Such a Big Deal (2023-12-24) :: View in Browser

Almost Timely News

πŸ‘‰ Register for my new Generative AI for Marketers course! Use ALMOSTTIMELY for $50 off the tuition

Content Authenticity Statement

100% of this week’s newsletter was generated by me, the human. When I use AI, I will disclose it prominently. Learn why this kind of disclosure is important.

Watch This Newsletter On YouTube πŸ“Ί

Click here for the video version of this newsletter on YouTube

Click here for the video πŸ“Ί version of this newsletter on YouTube Β»

Click here for an MP3 audio 🎧 only version »

What’s On My Mind: Why Mistral’s Mixture of Experts is Such a Big Deal

About two weeks ago, at the beginning of December 2023, the French AI company Mistral released a new model called Mixtral, which is a sort of neologism for Mistral Mixture of Experts. This made a well-deserved, huge splash in the AI community, but for those outside the tech nerd community, there might be some head scratching about why it’s a big deal.

So let’s walk through what this thing is, why it matters, and how you might or might not make use of it. First, Mixtral is a sparse mixture of experts language model. There’s a lot to unpack in just that sentence alone.

A mixture of experts model is when you take a language model, and within the inner workings, instead of having one model making decisions and generating outputs, you have several. The concept isn’t new; it was first conceived back in 1991 by Jacobs et. al. in a paper called Adaptive Mixtures of Local Experts.

So how is this different? When you use a system with a monolithic model, like ChatGPT with the free GPT-3.5-Turbo model (it’s rumored GPT-4’s current incarnations are also ensembles of models and not just one big model), your prompt goes into the system, the model makes it predictions, and it spits out its answer. The model has to be good at everything, and nothing within the model is checked for accuracy. To the extent that a language model has any checking, it’s done at the tuning phase where the model is taught how to answer questions.

In a mixture of experts model, instead of one big monolithic model, there’s an ensemble of different models within it. Your prompt gets parsed and then different tasks within the model are assigned. The component parts do their work, and then the results are assembled.

Here’s a familiar analogy. Think of a monolithic model as a really strong, really skilled chef. They get an order for a pizza, and they get to work, making the dough, mixing the sauce, preparing the toppings, getting the pizza into the oven, and boxing it up. The entire process is done by one person, and they have to be skilled at everything from beginning to end. This person has to be equally skilled at all parts of the job, has to be fast, and has to be accurate or you get a bad pizza. Thus, your pizza chef is probably very expensive to hire and retain, and because they have to be good at everything sequentially, it might take some time before your pizza is ready.

Now, think of a mixture of experts like a kitchen staff. There’s a head chef who takes the order, and then routes instructions to different folks on the team. One person gets started with the pizza sauce, another is chopping up toppings, a third is making the dough. They collaborate, get the pizza assembled, and then another person takes it out of the oven and boxes it up.

This model has a couple of key differences that make it preferable for certain tasks. For one thing, you can get more done in the same amount of time because you have multiple people working on component tasks. The person slicing the pepperoni doesn’t also have to toss the dough. The person boxing up the pizza isn’t the person making the sauce.

The second advantage is that not everyone has to be good at everything. The person who folds the pizza boxes and boxes up the pizzas coming out of the oven has to be good at their job, but they don’t have to be good at making sauce or dough – they can just focus on their job.

The third advantage is that not everyone has to be working all at the same time. In our example, the person folding pizza boxes and boxing up pizzas isn’t called onto the line until there’s a pizza ready to go. There’s no point in having that person standing around in the kitchen – we summon them when they have work to do, and otherwise we don’t activate them.

That’s what’s happening inside a mixture of experts model. A model like Mixtral will have component parts and a router. The router is like the head chef, parceling out tokens to different sub-models. For example, there might be a sub-model that’s good at verbs, another that’s good at proper nouns, another that’s good at adjectives, etc. and each gets work as the router sends it their way. The part that handles grammar might not be invoked until later in the process, so there’s some computational efficiency.

Now, there are downsides to the mixture of experts model. They are memory intensive – just like the pizza kitchen, you need a bigger kitchen to accommodate a team of 8 instead of a team of 1, even if that one person is physically robust. And you can get collisions of models and data interference, making the outputs potentially less stable. Again, think of the pizza kitchen – if the kitchen isn’t big enough, you’re going to have people running into each other.

Mixtral’s initial benchmarks place it at or just slightly above OpenAI’s GPT-3.5-Turbo model on general performance; on the Chatbot Arena leaderboard, it ranks above GPT-3.5-Turbo in terms of human reviews. That’s pretty incredible, given that you can run Mixtral on a beefy consumer laptop and you can’t do that with GPT-3.5-Turbo, which requires a room full of servers. And it’s very, very fast – it does inference at roughly the same speed as a 13B model. If you’ve dabbled in open weights models like LLaMa, you know that 13B models are a good balance of speed and coherence. Having a model like Mixtral that gives server-room level quality on a laptop in a timely manner is a big deal. If your MacBook Pro has an M series chip and 64 GB of total RAM, you can run Mixtral comfortably on it, or if you have a Windows machine with an NVIDIA RTX 3090/4090 graphics card, you can also run Mixtral comfortably.

When and how would you use a model like Mixtral? Mixtral’s primary use case is when you need accuracy and speed from a language model. As with many other language models, but especially open weights models, you want to avoid using it as a source of knowledge. It’s best suited for being a translation layer in your process, where it interprets the user’s response, goes to some form of data store like an internal database for answers, gets data from your data store, and then interprets the data back into language. It would be appropriate for use with a chatbot, for example, where speed is important and you want to control hallucination. You’d want to combine it with a system like AutoGen so that there’s a supervisor model running alongside that can reduce hallucinations and wrong answers.

However, that’s Mixtral today. What’s more important about the development of this model is that there’s a great, off-the-shelf mixture of experts LLM that outperforms GPT-3.5-Turbo that you and I can run at home or at work with sufficient consumer hardware. When you consider that Google’s much-publicized Gemini Pro model that was just released for Google Bard underperforms GPT-3.5-Turbo on some benchmarks, having a model like Mixtral available that doesn’t need a room full of servers is incredible. And the architecture that makes up Mixtral is one that other people can modify and train, iterate on, and tune to specific purposes so that it becomes highly fluent in specific tasks. Mixtral ships with the mixture of experts that the model makers thought best; there’s nothing stopping folks in the open weights AI community from changing out individual experts and routing to perform other tasks.

Mixtral is an example of having an office of B+ players working together to outperform what a single A player can do. It’s going to be a big part of the AI landscape for some time to come and the new gold standard for what’s possible in AI that you can run yourself without needing a third party vendor’s systems available at all times. And the mixture of experts technique has performed so well in real-world tests that I would expect it to be the path forward for many different AI models from now on.

Also this past week, I did a lengthy training on implementing compliance with the new EU AI Act, which is likely to become the gold standard for generative AI compliance around the world in the same way GDPR became the standard for data privacy. If you’d like to dig into that and what you need to do to comply, it’s baked into my new Generative AI for Marketers course.

How Was This Issue?

Rate this week’s newsletter issue with a single click. Your feedback over time helps me figure out what content to create for you.

Share With a Friend or Colleague

If you enjoy this newsletter and want to share it with a friend/colleague, please do. Send this URL to your friend/colleague:

https://www.christopherspenn.com/newsletter

For enrolled subscribers on Substack, there are referral rewards if you refer 100, 200, or 300 other readers. Visit the Leaderboard here.

ICYMI: In Case You Missed it

Besides the new Generative AI for Marketers course I’m relentlessly flogging, I recommend

12 Days of Data

As is tradition every year, I start publishing the 12 Days of Data, looking at the data that made the year. Here’s this year’s 12 Days of Data:

Skill Up With Classes

These are just a few of the classes I have available over at the Trust Insights website that you can take.

Premium

Free

Advertisement: Generative AI Workshops & Courses

Imagine a world where your marketing strategies are supercharged by the most cutting-edge technology available – Generative AI. Generative AI has the potential to save you incredible amounts of time and money, and you have the opportunity to be at the forefront. Get up to speed on using generative AI in your business in a thoughtful way with Trust Insights’ new offering, Generative AI for Marketers, which comes in two flavors, workshops and a course.

Workshops: Offer the Generative AI for Marketers half and full day workshops at your company. These hands-on sessions are packed with exercises, resources and practical tips that you can implement immediately.

πŸ‘‰ Click/tap here to book a workshop

Course: We’ve turned our most popular full-day workshop into a self-paced course. The Generative AI for Marketers online course is now available. Use discount code ALMOSTTIMELY for $50 off the course tuition.

πŸ‘‰ Click/tap here to pre-register for the course

If you work at a company or organization that wants to do bulk licensing, let me know!

Get Back to Work

Folks who post jobs in the free Analytics for Marketers Slack community may have those jobs shared here, too. If you’re looking for work, check out these recent open positions, and check out the Slack group for the comprehensive list.

How to Stay in Touch

Let’s make sure we’re connected in the places it suits you best. Here’s where you can find different content:

Advertisement: Ukraine πŸ‡ΊπŸ‡¦ Humanitarian Fund

The war to free Ukraine continues. If you’d like to support humanitarian efforts in Ukraine, the Ukrainian government has set up a special portal, United24, to help make contributing easy. The effort to free Ukraine from Russia’s illegal invasion needs our ongoing support.

πŸ‘‰ Donate today to the Ukraine Humanitarian Relief Fund Β»

Events I’ll Be At

Here’s where I’m speaking and attending. Say hi if you’re at an event also:

  • Tourism Industry Association of Alberta’s Tourism Summit, Edmonton, February 2024
  • Social Media Marketing World, San Diego, February 2024
  • MarketingProfs AI Series, Virtual, March 2024
  • Australian Food and Grocery Council, Melbourne, May 2024
  • MAICON, Cleveland, September 2024

Events marked with a physical location may become virtual if conditions and safety warrant it.

If you’re an event organizer, let me help your event shine. Visit my speaking page for more details.

Can’t be at an event? Stop by my private Slack group instead, Analytics for Marketers.

Required Disclosures

Events with links have purchased sponsorships in this newsletter and as a result, I receive direct financial compensation for promoting them.

Advertisements in this newsletter have paid to be promoted, and as a result, I receive direct financial compensation for promoting them.

My company, Trust Insights, maintains business partnerships with companies including, but not limited to, IBM, Cisco Systems, Amazon, Talkwalker, MarketingProfs, MarketMuse, Agorapulse, Hubspot, Informa, Demandbase, The Marketing AI Institute, and others. While links shared from partners are not explicit endorsements, nor do they directly financially benefit Trust Insights, a commercial relationship exists for which Trust Insights may receive indirect financial benefit, and thus I may receive indirect financial benefit from them as well.

Thank You

Thanks for subscribing and reading this far. I appreciate it. As always, thank you for your support, your attention, and your kindness.

See you next week,

Christopher S. Penn


You might also enjoy:


Want to read more like this from Christopher Penn? Get updates here:

subscribe to my newsletter here


AI for Marketers Book
Get your copy of AI For Marketers

Analytics for Marketers Discussion Group
Join my Analytics for Marketers Slack Group!