Almost Timely News, February 11, 2024: How To Evaluate a Generative AI System

Almost Timely News: How To Evaluate a Generative AI System (2024-02-11) :: View in Browser

Almost Timely News

This week, I recorded two new talks, free for you to enjoy:

Content Authenticity Statement

90% of this week’s newsletter was generated by me, the human. A good portion of the demo video shows generative AI results. Learn why this kind of disclosure is now legally required for anyone doing business in any capacity with the EU.

Watch This Newsletter On YouTube πŸ“Ί

Almost Timely News: How To Evaluate a Generative AI System (2024-02-11)

Click here for the video πŸ“Ί version of this newsletter on YouTube Β»

Click here for an MP3 audio 🎧 only version »

What’s On My Mind: How To Evaluate a Generative AI System

I strongly encourage you to watch the video edition of this week’s newsletter to see the actual results of generative AI.

This week, Google rebranded and relaunched its Bard service as Gemini, while OpenAI was making noises about GPT-5. Stable Diffusion’s Stable Video got a nice buff in terms of video quality output, and Mistral Medium has been climbing the charts over at LMSys’ Chatbot Arena. With all these models, it might be tempting to throw your hands up, pick one with a throw of a dart, and call it a day. So today, let’s talk about HOW to evaluate models to decide which is best for you.

Unsurprisingly, we’ll be using the Trust Insights 5P Framework to do this evaluation. Here are the five parts:

  • Purpose: what task do you want the model to do?
  • People: who will be using the model?
  • Process: what prompt will you be evaluating the model with?
  • Platform: which models are you evaluating?
  • Performance: score the models based on their output.

This is a very straightforward testing framework, but it helps codify and simplify the testing process – especially when you have a lot to test, or you have specific needs for testing.

So let’s dig in.


What task do you want the model to perform? Ideally this is a task suited to the model type you’re working with. For example, suppose you want to render an image as digital art. You have a gigantic bounty of open weights models on sites like Hugging Face and CivitAI as well as commercial SaaS tools like DALL-E, Microsoft Bing Image Creator, Photoshop, and Google Gemini.

The first step is to clearly define the task. What do you want to do? What are the criteria for success? For example, if you’re rendering an image of, say, a middle-age Korean woman CEO, at the very least the image should look like that kind of person – correct number of fingers, not blatantly a racist caricature, etc. If you’re evaluating a model to pick the winner of the Super Bowl, you’d want a clear, definitive answer, probably along with some reasoning about the model’s choices.


If you’re evaluating models for work, who else besides you will be using the model? What skills does that person need? Will they need to revise and upgrade your prompt? Understanding who the people are that will benefit from your selected model is critical – a model, for example, that requires relatively technical setup is probably going to be a non-starter for non-technical people. A good example of this is setting up ComfyUI with Stable Diffusion. For a technically skilled person, setting up this environment is trivial. For a non-technical person, asking them to clone a Git repo and run local Python code may as well be asking them to interpret ancient Sumerian.

We have to know our people to know what processes and platforms are even on the table.


For model comparison, we want a standardized prompt that follows basic best practices and is relatively portable across systems. After all, if there’s an important task you want to accomplish with a generative AI model, you want that task to work well on your platform of choice. Thus, you want to invest a lot of time up front in thinking through what that prompt should look like.

For example, with my Super Bowl prediction prompt, I copied all the post-season data from the NFL public website for offense, defense, and special teams for the two teams playing, plus set up a tree-of-thought prompt to walk through the data and really invest time in digging through it.

Generally speaking, big public models like Gemini, GPT-4, and Claude 2.1 can all more or less interchangeably parse the same prompt in very similar ways. They have enough data in them that you probably won’t get wildly different results. Some systems, like Bing and Gemini, will also augment what the models knows with data from other platforms, so clarifying whether a task relies on external data is important. Again, with my Super Bowl prompt, Bing and Gemini both pulled in player data as well as the team data I supplied, giving more robust answers than ChatGPT did.


Choosing a model depends on the task you’re trying to accomplish. If you’re doing language tasks, choose a language model and system like ChatGPT. If you’re doing image tasks, choose an image or multimodal model like DALL-E or Stable Diffusion. If you’re not sure, start with a multimodal model – Gemini and paid ChatGPT are good places to start.

How do you know what kind of model to pick? It’s based on your Purpose, which is why we start with purpose. Clearly defining what we want makes it easier to evaluate a model.


Finally, we get to the evaluation itself. Generally speaking, you want a combination of qualitative and quantitative evaluation. For tasks with clear success parameters – like extracting data from text into a table, for example – you want to have numeric scores. I use a 3 point system – 0 points if a model fails, 1 point if it minimally succeeds but quality of response is low, and 2 points if it fully succeeds. Again, for something like tabular data, if a model produces word salad and not a table, that would be a 0. If it makes a table but the table is clearly wrong, that’s a 1. And if it succeeds in processing the data correctly, that’s a 2.

So let’s step through an example to see how this might play out. I was talking with my friend Ashley Zeckman, CEO of Onalytica, the other day about thought leadership in the context of publishing content on LinkedIn. In that discussion, we realized that there were some very divergent points of view about what thought leadership even was. So let’s make a tree of thought prompt about the topic to see if we can arrive at a fresh, original perspective.

First, the purpose is clear. I’ll use a user story to define it. As a content creator, I need to determine which language model is capable of generating the most unique insights on a topic using tree of thought prompting so that I can have generative AI create better, more original content.

That’s a pretty clear user story. The people – well, that’s me. Let’s take a look at the process.

Here’s the prompt I’ll use:

Today, we’re going to simulate an academic debate between two points of view, along with a debate moderator. The topic of the debate is thought leadership in the context of marketing, personal brand, and social media. Our two debate contestants are:

– Ashley Awesome: Ashley Awesome is a personal branding expert who coaches executives on thought leadership and building a personal brand, especially on platforms like LinkedIn, YouTube, and Medium. Ashley wholeheartedly believes in the power of personal branding and thought leadership, and thinks thought leadership should be a core strategic pillar of any executive and company. Ashley’s tone is typically optimistic, but she can become frustrated when dealing with someone displaying willful ignorance or condescension.
– Christopher Contrary: Christopher Contrary is a marketing expert who is a non-believer in personal branding and thought leadership. Christopher thinks thought leadership is thinly disguised narcissism and promotional sales content, and so-called “thought leaders” on many platforms are recycling obvious points of view or taking needlessly provocative stances on settled issues to generate vapid attention. Christopher’s tone is confrontational and contrary, and can become brusque when repeatedly challenged.

The debate will be moderated by Betty Balanced. Betty is a cool-headed moderator with extensive experience in moderating controversial topics in high-stakes debates like presidential forums.

Structure the debate as a question from the moderator, followed by responses from each contestant. Each contestant may reply once in rebuttal before Betty moves onto the next debate question.

This is the format the debate should take:

BETTY: Good afternoon, ladies and gentlemen. Welcome to the World Leadership Forum. I’m your moderator, Betty Balanced. Today we will be hearing from our contestants, Ashley Awesome and Christopher Contrary, on the topic of thought leadership. Welcome, Ashley and Christopher.

ASHLEY: It’s a pleasure to be here.

CHRISTOPHER: Thank you for having me.

BETTY: With introductions out of the way, let’s begin with our first debate point. What is, from your point of view, thought leadership?

After a question has been answered and rebutted, wait for feedback from me, the user.

Begin the debate by having Betty ask the contestants to each define thought leadership.

In terms of platform, I want to evaluate Claude 2.1 in the Anthropic interface, GPT-4-Turbo in the OpenAI Playground, and Gemini in the Google Gemini interface.

Watch the video for this issue of the newsletter to see how GPT-4-Turbo, Claude 2.1, and Gemini handle this complex prompt.

You’d follow this process for any generative AI system. If you wanted to evaluate an image, you’d follow the 5Ps to set your purpose, determine the people involved, build a complex, robust prompt, choose the models and systems you want, and then evaluate the results. The reason you should do this is so that you evaluate generative AI for your specific needs. There are a lot of benchmarks and comparisons that people publish about all these different models, but most of the time, those benchmarks don’t reflect your specific needs. By following this framework, you will find the best fit for the generative AI model that meets your specific use cases – and it may not be the same model and software that others say is the best. Best is often personal.

How Was This Issue?

Rate this week’s newsletter issue with a single click. Your feedback over time helps me figure out what content to create for you.

Share With a Friend or Colleague

If you enjoy this newsletter and want to share it with a friend/colleague, please do. Send this URL to your friend/colleague:

For enrolled subscribers on Substack, there are referral rewards if you refer 100, 200, or 300 other readers. Visit the Leaderboard here.

ICYMI: In Case You Missed it

Besides the new Generative AI for Marketers course I’m relentlessly flogging, I recommend the podcast I did with Katie this week on data privacy and generative AI.

Skill Up With Classes

These are just a few of the classes I have available over at the Trust Insights website that you can take.



Advertisement: Generative AI Workshops & Courses

Imagine a world where your marketing strategies are supercharged by the most cutting-edge technology available – Generative AI. Generative AI has the potential to save you incredible amounts of time and money, and you have the opportunity to be at the forefront. Get up to speed on using generative AI in your business in a thoughtful way with Trust Insights’ new offering, Generative AI for Marketers, which comes in two flavors, workshops and a course.

Workshops: Offer the Generative AI for Marketers half and full day workshops at your company. These hands-on sessions are packed with exercises, resources and practical tips that you can implement immediately.

πŸ‘‰ Click/tap here to book a workshop

Course: We’ve turned our most popular full-day workshop into a self-paced course. The Generative AI for Marketers online course is now available and just updated this week! Use discount code ALMOSTTIMELY for $50 off the course tuition.

πŸ‘‰ Click/tap here to pre-register for the course

If you work at a company or organization that wants to do bulk licensing, let me know!

Get Back to Work

Folks who post jobs in the free Analytics for Marketers Slack community may have those jobs shared here, too. If you’re looking for work, check out these recent open positions, and check out the Slack group for the comprehensive list.

What I’m Reading: Your Stuff

Let’s look at the most interesting content from around the web on topics you care about, some of which you might have even written.

Social Media Marketing

Media and Content

SEO, Google, and Paid Media

Advertisement: Business Cameos

If you’re familiar with the Cameo system – where people hire well-known folks for short video clips – then you’ll totally get Thinkers One. Created by my friend Mitch Joel, Thinkers One lets you connect with the biggest thinkers for short videos on topics you care about. I’ve got a whole slew of Thinkers One Cameo-style topics for video clips you can use at internal company meetings, events, or even just for yourself. Want me to tell your boss that you need to be paying attention to generative AI right now?

πŸ“Ί Pop on by my Thinkers One page today and grab a video now.

Tools, Machine Learning, and AI

Analytics, Stats, and Data Science

All Things IBM

Dealer’s Choice : Random Stuff

How to Stay in Touch

Let’s make sure we’re connected in the places it suits you best. Here’s where you can find different content:

Advertisement: Ukraine πŸ‡ΊπŸ‡¦ Humanitarian Fund

The war to free Ukraine continues. If you’d like to support humanitarian efforts in Ukraine, the Ukrainian government has set up a special portal, United24, to help make contributing easy. The effort to free Ukraine from Russia’s illegal invasion needs our ongoing support.

πŸ‘‰ Donate today to the Ukraine Humanitarian Relief Fund Β»

Events I’ll Be At

Here’s where I’m speaking and attending. Say hi if you’re at an event also:

  • Social Media Marketing World, San Diego, February 2024
  • MarketingProfs AI Series, Virtual, March 2024
  • Society for Marketing Professional Services, Boston, April 2024
  • Society for Marketing Professional Services, Los Angeles, May 2024
  • Australian Food and Grocery Council, Melbourne, May 2024
  • MAICON, Cleveland, September 2024

Events marked with a physical location may become virtual if conditions and safety warrant it.

If you’re an event organizer, let me help your event shine. Visit my speaking page for more details.

Can’t be at an event? Stop by my private Slack group instead, Analytics for Marketers.

Required Disclosures

Events with links have purchased sponsorships in this newsletter and as a result, I receive direct financial compensation for promoting them.

Advertisements in this newsletter have paid to be promoted, and as a result, I receive direct financial compensation for promoting them.

My company, Trust Insights, maintains business partnerships with companies including, but not limited to, IBM, Cisco Systems, Amazon, Talkwalker, MarketingProfs, MarketMuse, Agorapulse, Hubspot, Informa, Demandbase, The Marketing AI Institute, and others. While links shared from partners are not explicit endorsements, nor do they directly financially benefit Trust Insights, a commercial relationship exists for which Trust Insights may receive indirect financial benefit, and thus I may receive indirect financial benefit from them as well.

Thank You

Thanks for subscribing and reading this far. I appreciate it. As always, thank you for your support, your attention, and your kindness.

See you next week,

Christopher S. Penn

You might also enjoy:

Want to read more like this from Christopher Penn? Get updates here:

subscribe to my newsletter here

AI for Marketers Book
Take my Generative AI for Marketers course!

Analytics for Marketers Discussion Group
Join my Analytics for Marketers Slack Group!


One response to “Almost Timely News, February 11, 2024: How To Evaluate a Generative AI System”

Leave a Reply

Your email address will not be published. Required fields are marked *

Pin It on Pinterest

Share This