Almost Timely News: How To Evaluate a Generative AI System (2024-02-11) :: View in Browser
This week, I recorded two new talks, free for you to enjoy:
- 👉 Predictive Analytics and Generative AI for Travel, Tourism, and Hospitality
- 👉 Building the Data-Driven, AI-Powered Customer Journey for Retail and Ecommerce
Content Authenticity Statement
90% of this week’s newsletter was generated by me, the human. A good portion of the demo video shows generative AI results. Learn why this kind of disclosure is now legally required for anyone doing business in any capacity with the EU.
Watch This Newsletter On YouTube 📺
What’s On My Mind: How To Evaluate a Generative AI System
I strongly encourage you to watch the video edition of this week’s newsletter to see the actual results of generative AI.
This week, Google rebranded and relaunched its Bard service as Gemini, while OpenAI was making noises about GPT-5. Stable Diffusion’s Stable Video got a nice buff in terms of video quality output, and Mistral Medium has been climbing the charts over at LMSys’ Chatbot Arena. With all these models, it might be tempting to throw your hands up, pick one with a throw of a dart, and call it a day. So today, let’s talk about HOW to evaluate models to decide which is best for you.
Unsurprisingly, we’ll be using the Trust Insights 5P Framework to do this evaluation. Here are the five parts:
- Purpose: what task do you want the model to do?
- People: who will be using the model?
- Process: what prompt will you be evaluating the model with?
- Platform: which models are you evaluating?
- Performance: score the models based on their output.
This is a very straightforward testing framework, but it helps codify and simplify the testing process – especially when you have a lot to test, or you have specific needs for testing.
So let’s dig in.
What task do you want the model to perform? Ideally this is a task suited to the model type you’re working with. For example, suppose you want to render an image as digital art. You have a gigantic bounty of open weights models on sites like Hugging Face and CivitAI as well as commercial SaaS tools like DALL-E, Microsoft Bing Image Creator, Photoshop, and Google Gemini.
The first step is to clearly define the task. What do you want to do? What are the criteria for success? For example, if you’re rendering an image of, say, a middle-age Korean woman CEO, at the very least the image should look like that kind of person – correct number of fingers, not blatantly a racist caricature, etc. If you’re evaluating a model to pick the winner of the Super Bowl, you’d want a clear, definitive answer, probably along with some reasoning about the model’s choices.
If you’re evaluating models for work, who else besides you will be using the model? What skills does that person need? Will they need to revise and upgrade your prompt? Understanding who the people are that will benefit from your selected model is critical – a model, for example, that requires relatively technical setup is probably going to be a non-starter for non-technical people. A good example of this is setting up ComfyUI with Stable Diffusion. For a technically skilled person, setting up this environment is trivial. For a non-technical person, asking them to clone a Git repo and run local Python code may as well be asking them to interpret ancient Sumerian.
We have to know our people to know what processes and platforms are even on the table.
For model comparison, we want a standardized prompt that follows basic best practices and is relatively portable across systems. After all, if there’s an important task you want to accomplish with a generative AI model, you want that task to work well on your platform of choice. Thus, you want to invest a lot of time up front in thinking through what that prompt should look like.
For example, with my Super Bowl prediction prompt, I copied all the post-season data from the NFL public website for offense, defense, and special teams for the two teams playing, plus set up a tree-of-thought prompt to walk through the data and really invest time in digging through it.
Generally speaking, big public models like Gemini, GPT-4, and Claude 2.1 can all more or less interchangeably parse the same prompt in very similar ways. They have enough data in them that you probably won’t get wildly different results. Some systems, like Bing and Gemini, will also augment what the models knows with data from other platforms, so clarifying whether a task relies on external data is important. Again, with my Super Bowl prompt, Bing and Gemini both pulled in player data as well as the team data I supplied, giving more robust answers than ChatGPT did.
Choosing a model depends on the task you’re trying to accomplish. If you’re doing language tasks, choose a language model and system like ChatGPT. If you’re doing image tasks, choose an image or multimodal model like DALL-E or Stable Diffusion. If you’re not sure, start with a multimodal model – Gemini and paid ChatGPT are good places to start.
How do you know what kind of model to pick? It’s based on your Purpose, which is why we start with purpose. Clearly defining what we want makes it easier to evaluate a model.
Finally, we get to the evaluation itself. Generally speaking, you want a combination of qualitative and quantitative evaluation. For tasks with clear success parameters – like extracting data from text into a table, for example – you want to have numeric scores. I use a 3 point system – 0 points if a model fails, 1 point if it minimally succeeds but quality of response is low, and 2 points if it fully succeeds. Again, for something like tabular data, if a model produces word salad and not a table, that would be a 0. If it makes a table but the table is clearly wrong, that’s a 1. And if it succeeds in processing the data correctly, that’s a 2.
So let’s step through an example to see how this might play out. I was talking with my friend Ashley Zeckman, CEO of Onalytica, the other day about thought leadership in the context of publishing content on LinkedIn. In that discussion, we realized that there were some very divergent points of view about what thought leadership even was. So let’s make a tree of thought prompt about the topic to see if we can arrive at a fresh, original perspective.
First, the purpose is clear. I’ll use a user story to define it. As a content creator, I need to determine which language model is capable of generating the most unique insights on a topic using tree of thought prompting so that I can have generative AI create better, more original content.
That’s a pretty clear user story. The people – well, that’s me. Let’s take a look at the process.
Here’s the prompt I’ll use:
Today, we’re going to simulate an academic debate between two points of view, along with a debate moderator. The topic of the debate is thought leadership in the context of marketing, personal brand, and social media. Our two debate contestants are:
– Ashley Awesome: Ashley Awesome is a personal branding expert who coaches executives on thought leadership and building a personal brand, especially on platforms like LinkedIn, YouTube, and Medium. Ashley wholeheartedly believes in the power of personal branding and thought leadership, and thinks thought leadership should be a core strategic pillar of any executive and company. Ashley’s tone is typically optimistic, but she can become frustrated when dealing with someone displaying willful ignorance or condescension.
– Christopher Contrary: Christopher Contrary is a marketing expert who is a non-believer in personal branding and thought leadership. Christopher thinks thought leadership is thinly disguised narcissism and promotional sales content, and so-called “thought leaders” on many platforms are recycling obvious points of view or taking needlessly provocative stances on settled issues to generate vapid attention. Christopher’s tone is confrontational and contrary, and can become brusque when repeatedly challenged.
The debate will be moderated by Betty Balanced. Betty is a cool-headed moderator with extensive experience in moderating controversial topics in high-stakes debates like presidential forums.
Structure the debate as a question from the moderator, followed by responses from each contestant. Each contestant may reply once in rebuttal before Betty moves onto the next debate question.
This is the format the debate should take:
BETTY: Good afternoon, ladies and gentlemen. Welcome to the World Leadership Forum. I’m your moderator, Betty Balanced. Today we will be hearing from our contestants, Ashley Awesome and Christopher Contrary, on the topic of thought leadership. Welcome, Ashley and Christopher.
ASHLEY: It’s a pleasure to be here.
CHRISTOPHER: Thank you for having me.
BETTY: With introductions out of the way, let’s begin with our first debate point. What is, from your point of view, thought leadership?
After a question has been answered and rebutted, wait for feedback from me, the user.
Begin the debate by having Betty ask the contestants to each define thought leadership.
In terms of platform, I want to evaluate Claude 2.1 in the Anthropic interface, GPT-4-Turbo in the OpenAI Playground, and Gemini in the Google Gemini interface.
Watch the video for this issue of the newsletter to see how GPT-4-Turbo, Claude 2.1, and Gemini handle this complex prompt.
You’d follow this process for any generative AI system. If you wanted to evaluate an image, you’d follow the 5Ps to set your purpose, determine the people involved, build a complex, robust prompt, choose the models and systems you want, and then evaluate the results. The reason you should do this is so that you evaluate generative AI for your specific needs. There are a lot of benchmarks and comparisons that people publish about all these different models, but most of the time, those benchmarks don’t reflect your specific needs. By following this framework, you will find the best fit for the generative AI model that meets your specific use cases – and it may not be the same model and software that others say is the best. Best is often personal.
How Was This Issue?
Rate this week’s newsletter issue with a single click. Your feedback over time helps me figure out what content to create for you.
Share With a Friend or Colleague
If you enjoy this newsletter and want to share it with a friend/colleague, please do. Send this URL to your friend/colleague:
For enrolled subscribers on Substack, there are referral rewards if you refer 100, 200, or 300 other readers. Visit the Leaderboard here.
ICYMI: In Case You Missed it
Besides the new Generative AI for Marketers course I’m relentlessly flogging, I recommend the podcast I did with Katie this week on data privacy and generative AI.
- In-Ear Insights: Data Privacy and Generative AI
- Almost Timely News, February 4, 2024: What AI Has Made Scarce
- Now with More Mariachi Button!
- So What? Generative AI and Photoshop
- How To Make Google Analytics 4 Work For You
- INBOX INSIGHTS, February 7, 2024: Building Requirements with AI, GA4 Diagnostics
- RED TEAMING CUSTOM GPTs, PART 3
- Disclosure of AI and Protection of Copyright
Skill Up With Classes
These are just a few of the classes I have available over at the Trust Insights website that you can take.
- 🦾 Generative AI for Marketers
- 👉 Google Analytics 4 for Marketers
- 👉 Google Search Console for Marketers (🚨 just updated with AI SEO stuff! 🚨)
- Predictive Analytics and Generative AI for Travel, Tourism, and Hospitality, 2024 Edition
- Building the Data-Driven, AI-Powered Customer Journey for Retail and Ecommerce, 2024 Edition
- The Marketing Singularity: How Generative AI Means the End of Marketing As We Knew It
- Powering Up Your LinkedIn Profile (For Job Hunters) 2023 Edition
- Measurement Strategies for Agencies
- Empower Your Marketing With Private Social Media Communities
- Exploratory Data Analysis: The Missing Ingredient for AI
- How to Prove Social Media ROI
- Proving Social Media ROI
- Paradise by the Analytics Dashboard Light: How to Create Impactful Dashboards and Reports
Advertisement: Generative AI Workshops & Courses
Imagine a world where your marketing strategies are supercharged by the most cutting-edge technology available – Generative AI. Generative AI has the potential to save you incredible amounts of time and money, and you have the opportunity to be at the forefront. Get up to speed on using generative AI in your business in a thoughtful way with Trust Insights’ new offering, Generative AI for Marketers, which comes in two flavors, workshops and a course.
Workshops: Offer the Generative AI for Marketers half and full day workshops at your company. These hands-on sessions are packed with exercises, resources and practical tips that you can implement immediately.
Course: We’ve turned our most popular full-day workshop into a self-paced course. The Generative AI for Marketers online course is now available and just updated this week! Use discount code ALMOSTTIMELY for $50 off the course tuition.
If you work at a company or organization that wants to do bulk licensing, let me know!
Get Back to Work
Folks who post jobs in the free Analytics for Marketers Slack community may have those jobs shared here, too. If you’re looking for work, check out these recent open positions, and check out the Slack group for the comprehensive list.
- Content Strategist, Financial Services at Content Strategist, Financial Services
- Digital Analyst – Phase2 Technology at Digital Analyst
- Seer Interactive – Strategy & Analytics Consulting Manager at Seer Interactive
- Senior Web Analytics Consultant – Artefact at Senior Web Analytics Consultant
- Sr. Director, Digital Product Intelligence In Multiple Locations at Marriott International
- Technical Content Marketing Writer – Generative Ai Specialist at Technical content marketing writer
- Vacancies at UNICEF Careers
What I’m Reading: Your Stuff
Let’s look at the most interesting content from around the web on topics you care about, some of which you might have even written.
Social Media Marketing
- The Easiest Way to Shorten a Video for Social Media via The TechSmith Blog
- Social Media Benchmarks by Industry in 2024 via Sprout Social
- Instagram Analytics: 2024 Guide to Smarter Results Tracking via Social Media Marketing & Management Dashboard
Media and Content
- Why Cant My PR Agency Guarantee Coverage? via Firecracker PR
- Outsourcing Content Creation: A 5-Step Vetting Process
- Mind the Gap: How to Leverage Resume Gaps for Future Success
SEO, Google, and Paid Media
- SEO for Law Firms: Attract More Clients & Grow Your Practice
- 23 Great Search Engines You Can Use Instead Of Google
- SEO Content Strategy: From Basic to Advanced
Advertisement: Business Cameos
If you’re familiar with the Cameo system – where people hire well-known folks for short video clips – then you’ll totally get Thinkers One. Created by my friend Mitch Joel, Thinkers One lets you connect with the biggest thinkers for short videos on topics you care about. I’ve got a whole slew of Thinkers One Cameo-style topics for video clips you can use at internal company meetings, events, or even just for yourself. Want me to tell your boss that you need to be paying attention to generative AI right now?
Tools, Machine Learning, and AI
- How to keep your art out of AI generators via The Verge
- Generate Code, Answer Queries, and Translate Text with New NVIDIA AI Foundation Models via NVIDIA Technical Blog
- How to Create Your Own AI Influencer?
Analytics, Stats, and Data Science
- Math and Data Science: What Do You Need to Know?
- What Data Science Can Do for Site Architectures
- Instagram Analytics: 2024 Guide to Smarter Results Tracking via Social Media Marketing & Management Dashboard
All Things IBM
- The most important AI trends in 2024 via IBM Blog
- The history of climate change via IBM Blog
- Preparing for the EU AI Act: Getting governance right via IBM Blog
Dealer’s Choice : Random Stuff
- As Iran-backed groups attack Red Sea ships, investors are backing startups assisting global cargo via TechCrunch
- DraftKings, FanDuel Will Be Super Bowl Winners Amid Taylor Swift Effect
- Top 10 AI Games Shaping the Future of Gaming Industry via Analytics Vidhya
How to Stay in Touch
Let’s make sure we’re connected in the places it suits you best. Here’s where you can find different content:
- My blog – daily videos, blog posts, and podcast episodes
- My YouTube channel – daily videos, conference talks, and all things video
- My company, Trust Insights – marketing analytics help
- My podcast, Marketing over Coffee – weekly episodes of what’s worth noting in marketing
- My second podcast, In-Ear Insights – the Trust Insights weekly podcast focused on data and analytics
- On Threads – random personal stuff and chaos
- On LinkedIn – daily videos and news
- On Instagram – personal photos and travels
- My free Slack discussion forum, Analytics for Marketers – open conversations about marketing and analytics
Advertisement: Ukraine 🇺🇦 Humanitarian Fund
The war to free Ukraine continues. If you’d like to support humanitarian efforts in Ukraine, the Ukrainian government has set up a special portal, United24, to help make contributing easy. The effort to free Ukraine from Russia’s illegal invasion needs our ongoing support.
Events I’ll Be At
Here’s where I’m speaking and attending. Say hi if you’re at an event also:
- Social Media Marketing World, San Diego, February 2024
- MarketingProfs AI Series, Virtual, March 2024
- Society for Marketing Professional Services, Boston, April 2024
- Society for Marketing Professional Services, Los Angeles, May 2024
- Australian Food and Grocery Council, Melbourne, May 2024
- MAICON, Cleveland, September 2024
Events marked with a physical location may become virtual if conditions and safety warrant it.
If you’re an event organizer, let me help your event shine. Visit my speaking page for more details.
Can’t be at an event? Stop by my private Slack group instead, Analytics for Marketers.
Events with links have purchased sponsorships in this newsletter and as a result, I receive direct financial compensation for promoting them.
Advertisements in this newsletter have paid to be promoted, and as a result, I receive direct financial compensation for promoting them.
My company, Trust Insights, maintains business partnerships with companies including, but not limited to, IBM, Cisco Systems, Amazon, Talkwalker, MarketingProfs, MarketMuse, Agorapulse, Hubspot, Informa, Demandbase, The Marketing AI Institute, and others. While links shared from partners are not explicit endorsements, nor do they directly financially benefit Trust Insights, a commercial relationship exists for which Trust Insights may receive indirect financial benefit, and thus I may receive indirect financial benefit from them as well.
Thanks for subscribing and reading this far. I appreciate it. As always, thank you for your support, your attention, and your kindness.
See you next week,
Christopher S. Penn
You might also enjoy:
- Vision, Mission, Strategy, Tactics, and Execution
- Almost Timely News, December 10, 2023: Where Generative AI and Language Models are Probably Going in 2024
- Almost Timely News, December 31, 2023: Three Words and Four Enemies of 2024
- What's the Difference Between Social Media and New Media?
- How To Determine Whether Something is a Trend
Want to read more like this from Christopher Penn? Get updates here:
Get your copy of AI For Marketers