Almost Timely News, April 28, 2024: 🗞️ Building a Synthetic Dataset with Generative AI

Almost Timely News: 🗞️ Building a Synthetic Dataset with Generative AI (2024-04-28) :: View in Browser

Almost Timely News

👉 Did you miss my webinar this past week on generative AI for agencies? Go catch the free replay!

Content Authenticity Statement

100% of this week’s newsletter was generated by me, the human. Learn why this kind of disclosure is a good idea and might be required for anyone doing business in any capacity with the EU in the near future.

Watch This Newsletter On YouTube 📺

Almost Timely News: 🗞️ Building a Synthetic Dataset with Generative AI

Click here for the video 📺 version of this newsletter on YouTube »

Click here for an MP3 audio 🎧 only version »

What’s On My Mind: Building a Synthetic Dataset with Generative AI

Jesper asked on YouTube this past week if I’d ever done a tutorial or walkthrough of building a synthetic dataset using generative AI. I’ve covered this lightly in the past, but mostly in passing. First, let’s talk about why you would even want to do such a thing.

Synthetic datasets have a bunch of different uses. If you’re working with incredibly sensitive data but you need to collaborate with others, you might want to generate a dataset that has the characteristics of your data but none of the valuable information. For example, you might be working with user data, or healthcare data, or very specific financial data – all datasets that are highly regulated and protected, for good reason. You can’t share that information with unauthorized people.

Another reason for synthetic datasets is to supplement existing data. Everyone and their cousin is all in on generative AI, but once you start talking about tuning models and customizing them, it becomes blatantly obvious most organizations just don’t have enough data to get statistically meaningful results from the process. Synthetic data, patterned on your existing data, can boost the amount of data you have available to use.

A third reason is regulatory requirements. Under legislation like GDPR, if you collected data for one purpose, you can’t go using it for another purpose. If you collected emails and email marketing engagement data for email marketing purposes, you’re aligned with what the user gave consent for. Using that data for generative AI? Nope. That’s not permitted under GDPR. You would have to go back to all your users and ask permission for that. But if you created a synthetic dataset that mimicked your existing data but had none of the actual data in it, you’re good to go.

Your reasons for using synthetic data will largely dictate how you go about generating it. For just not having enough data, generating more of the same kind of data is a very straightforward task. For having data you can’t share due to privacy and sensitivity, you have to go through some statistical processes first. And for adhering to regulatory requirements, that’s probably the most tricky use case of all.

So with that backdrop, let’s go ahead and look at the process of creating synthetic data. We’ll start with the easiest use case first, just making more stuff. Let’s say you have a dataset and you just need more of it. The first question you have to ask is whether there are patterns in the data that you need to replicate, or you just need more of the stuff in general.

For example, suppose you wanted a large dataset of Instagram captions, perhaps to fine-tune a large language model on social media sentiment. You could take an existing dataset and hand it to a model like Google Gemini and simply ask it to generate more data that resembles the existing dataset. You’d not include any of the quantitative data, just the unstructured text, and tell it make more of it matching the patterns, vocabulary, and writing style of the original dataset.

However, if you’re wanting to create a dataset for use with classification, you’d probably want data that has defined categories, like captions for Reels, photos, and albums. In that case, you’d want to specify to the language model what example data you have for each category, then have it generate more within each category. For the best performance, you’d separate out the original datasets into those categories to begin with, and then ask for the same kind of generation.

You’ll note that so far, we’re excluding the quantitative data. The reason for that is focus; language models can interpret numerical data, but as with all generative AI tasks, the more focused your inquiries are, the better the models tend to perform. If you don’t need quantitative data in your synthetic dataset, don’t include it.

Suppose quantitative data did matter. What would you do then? As you did with the classification dataset, you’d want to bin your quantitative data and then generate more of it by bin as a discrete task. For example, your starting dataset might be binned into quartiles (25% increments); you’d provide each quartile to the model and ask it to synthesize that content plus the quantitative data within a specific range, the range of the bin.

Why not have it do everything all at once? Specificity. The more you can break down a task and make it granular, the better the models will perform.

So that’s the first use case and a half, making more stuff from the stuff you have. It’s the foundation technique, and you’ll find that today’s very large models are capable of doing it quite well. The more training data you can provide, the better the models will perform. Giving them 10 examples will generate okay results. Giving them 100 examples will be better, and 1,000 examples even better than that.

Let’s dig into the second use case, working with data that requires cleaning to remove protected attributes, like personally identifying information. Personally identifying information (PII) – like email addresses – are not something you want to be handing out, especially if you want to hand the data itself to someone else to work with it. So how would you use generative AI to work with this data?

First, using traditional data management techniques, replace all the existing PII with unique identifiers. There are any number of software libraries and packages capable of doing this; you can even have generative AI write you a script in a language like Python or R to perform this task. You can even have it replace named entities (names of people, places, and things) within unstructured text to further obscure personal information.

Once you’ve done this task of sanitizing the source data, you can then hand it to generative AI and have it replicate more of it, following the foundational techniques we discussed in the first section. Here’s the critical difference; once you’ve generated a new dataset that’s based on the original (perhaps with binning and quantitative data) you want to REMOVE the original dataset. That way, the data you hand to another analyst or party is purely synthetic. It’ll have the same numerical aspects and statistical patterns, but no source data at all is being handed to a party that’s not authorized to view the source data.

Now, let’s tackle the thorniest use case: synthetic generation of data to work around data you don’t have permission to work with. To do this and remain compliant with laws and regulations, you cannot use ANY source data at all, and thus generation technique will generate less accurate data than the other techniques. I will also caution you that I am not a lawyer and cannot give legal advice. Consult with your legal team for legal advice specific to your situation.

Suppose you wanted to generate some customer interactions for training a language model. What you can’t do, if you want to be in strict alignment with regulations like GDPR and CPRA, is use any actual customer data for synthetic generation. What you CAN do is use your own recall of aggregate information about customers to build a series of synthetic customer profiles, and then generate data from those profiles.

Let’s look at an example. Suppose Trust Insights wanted to generate synthetic data about our EU customer base and we hadn’t obtained customer permissions to use their data for this purpose. How would we go about doing this? First, we can develop a general understanding of our customer base. Across our base – perhaps by talking to our sales people or account managers – we could understand the general job titles of people who are customers. We could also get a general understanding of the characteristics of those people – affinities, interests, etc. We could also extract our own data about our customer base as a whole, things like average deal size or average annual revenue from a particular market or set of companies. From there we’d use a large language model to start inferring the characteristics of this customer persona by asking us general questions about it.

Once we have sufficiently well developed personae, we can instruct the model to start generating the data we want. Now, to be clear, there is a greater risk of hallucination – aka statistically valid but factually incorrect knowledge – being generated here. We’re working off anecdotes and assumptions that may not be grounded in fact. It’s always better to use actual data rather than to work off assumptions, but if we have absolutely no access to data permitted by law, this would be a workaround until we get real data obtained with consent.

That last part is the most important part; purely generated data cobbled together from assumptions isn’t a long-term solution. It’s a stopgap measure to let you start building with data until you obtain real data with permission to ground your synthetic data generation in reality.

Synthetic datasets solve for a lot of problems in AI and data science, but sometimes those solutions are stopgaps until you fix the real problem (like user consent), and other times they’re the only solution (like insufficient volume of data). What’s most important is that you’re clear on the problem you’re trying to solve before you use synthetic data.

And shameless plug, if you want help with synthetic data, this is literally what my company does, so if getting started with this use of generative AI is of interest, hit me up.

How Was This Issue?

Rate this week’s newsletter issue with a single click. Your feedback over time helps me figure out what content to create for you.

Share With a Friend or Colleague

If you enjoy this newsletter and want to share it with a friend/colleague, please do. Send this URL to your friend/colleague:

For enrolled subscribers on Substack, there are referral rewards if you refer 100, 200, or 300 other readers. Visit the Leaderboard here.

ICYMI: In Case You Missed it

Besides the newly updated Generative AI for Marketers course I’m relentlessly flogging, we had a killer livestream this past week on using AI for SEO. Tons of useful tips, so go check it out!

Skill Up With Classes

These are just a few of the classes I have available over at the Trust Insights website that you can take.



Advertisement: Generative AI Workshops & Courses

Imagine a world where your marketing strategies are supercharged by the most cutting-edge technology available – Generative AI. Generative AI has the potential to save you incredible amounts of time and money, and you have the opportunity to be at the forefront. Get up to speed on using generative AI in your business in a thoughtful way with Trust Insights’ new offering, Generative AI for Marketers, which comes in two flavors, workshops and a course.

Workshops: Offer the Generative AI for Marketers half and full day workshops at your company. These hands-on sessions are packed with exercises, resources and practical tips that you can implement immediately.

👉 Click/tap here to book a workshop

Course: We’ve turned our most popular full-day workshop into a self-paced course. The Generative AI for Marketers online course is now available and just updated as of April 12! Use discount code ALMOSTTIMELY for $50 off the course tuition.

👉 Click/tap here to pre-register for the course

If you work at a company or organization that wants to do bulk licensing, let me know!

Get Back to Work

Folks who post jobs in the free Analytics for Marketers Slack community may have those jobs shared here, too. If you’re looking for work, check out these recent open positions, and check out the Slack group for the comprehensive list.

What I’m Reading: Your Stuff

Let’s look at the most interesting content from around the web on topics you care about, some of which you might have even written.

Social Media Marketing

Media and Content

SEO, Google, and Paid Media

Advertisement: Free Generative AI Cheat Sheets

The RACE Prompt Framework: This is a great starting prompt framework, especially well-suited for folks just trying out language models. PDFs are available in US English, Latin American Spanish, and Brazilian Portuguese.

4 Generative AI Power Questions: Use these four questions (the PARE framework) with any large language model like ChatGPT/Gemini/Claude etc. to dramatically improve the results. PDFs are available in US English, Latin American Spanish, and Brazilian Portuguese.

The Beginner’s Generative AI Starter Kit: This one-page table shows common tasks and associated models for those tasks. PDF available in US English (mainly because it’s a pile of links)

Tools, Machine Learning, and AI

All Things IBM

Dealer’s Choice : Random Stuff

How to Stay in Touch

Let’s make sure we’re connected in the places it suits you best. Here’s where you can find different content:

Advertisement: Ukraine 🇺🇦 Humanitarian Fund

The war to free Ukraine continues. If you’d like to support humanitarian efforts in Ukraine, the Ukrainian government has set up a special portal, United24, to help make contributing easy. The effort to free Ukraine from Russia’s illegal invasion needs your ongoing support.

👉 Donate today to the Ukraine Humanitarian Relief Fund »

Events I’ll Be At

Here are the public events where I’m speaking and attending. Say hi if you’re at an event also:

  • Australian Food and Grocery Council, Melbourne, May 2024
  • Society for Marketing Professional Services, Los Angeles, May 2024
  • MAICON, Cleveland, September 2024
  • MarketingProfs B2B Forum, Boston, November 2024

There are also private events that aren’t open to the public.

If you’re an event organizer, let me help your event shine. Visit my speaking page for more details.

Can’t be at an event? Stop by my private Slack group instead, Analytics for Marketers.

Required Disclosures

Events with links have purchased sponsorships in this newsletter and as a result, I receive direct financial compensation for promoting them.

Advertisements in this newsletter have paid to be promoted, and as a result, I receive direct financial compensation for promoting them.

My company, Trust Insights, maintains business partnerships with companies including, but not limited to, IBM, Cisco Systems, Amazon, Talkwalker, MarketingProfs, MarketMuse, Agorapulse, Hubspot, Informa, Demandbase, The Marketing AI Institute, and others. While links shared from partners are not explicit endorsements, nor do they directly financially benefit Trust Insights, a commercial relationship exists for which Trust Insights may receive indirect financial benefit, and thus I may receive indirect financial benefit from them as well.

Thank You

Thanks for subscribing and reading this far. I appreciate it. As always, thank you for your support, your attention, and your kindness.

See you next week,

Christopher S. Penn

You might also enjoy:

Want to read more like this from Christopher Penn? Get updates here:

subscribe to my newsletter here

AI for Marketers Book
Take my Generative AI for Marketers course!

Analytics for Marketers Discussion Group
Join my Analytics for Marketers Slack Group!


One response to “Almost Timely News, April 28, 2024: 🗞️ Building a Synthetic Dataset with Generative AI”

  1. […] Almost Timely News, April 28, 2024: 🗞️ Building a Synthetic Dataset with Generative AI […]

Leave a Reply

Your email address will not be published. Required fields are marked *

Pin It on Pinterest

Share This