You Ask, I Answer: Capturing Voices with AI?

In today’s episode, I explain how to capture someone’s unique writing style or speaking voice for AI tools. For writing, use neural style transfer on a diverse sample of their work. For voice, have them read varied scripts to capture tone and inflection. Tune in for specific tips on gathering the right training data to clone voices and writing styles with AI.

You Ask, I Answer: Capturing Voices with AI?

Watch this video on YouTube.

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.

In today’s episode, Briar asks, How do we program a voice of a CEO or expert we frequently write for? So that we can use their voice? Well, okay, so there’s, there’s a bit of ambiguity here. If you mean the speaking voice, that’s one avenue, if you mean just their writing style, that’s a different avenue.

So let’s tackle the writing style first. If you want to capture someone’s writing style, there are there’s a technique called neural style transfer. And essentially, using a tool like chat GPT, the paid version, or Anthropics, Claude to you would take a large sample of someone’s writing, ideally a diverse sample. So blog post, an article, some emails, maybe some social media comments, something that a body of work, and we’re talking probably a couple of pages, at least, of text that really encompasses how a person speaks their voice, if you will.

That then gets fed into one of these large language models with a neural style transfer prompt. And essentially, it’s, it’s pretty straightforward. It’s like you are a world class writing expert, you know, style transfer, writing styles, author voices, blah, blah, blah, all the keywords and phrases that would be associated with writing styles. You would say your first task is to do a detailed analysis of this person’s writing style in bullet point format, and it will generate a long list of these things. And then you would use that bullet point list, essentially as its own prompt to apply to the next piece of content you want to generate, you would say something along the lines of using this defined writing style, writing an article about x, y, or z. So that’s how you capture someone’s voice in text.

If you were talking about the actual cloning of someone’s voice, using a tool like voice gen tortoise or 11 labs, you need to start with good quality sampled audio, ideally something that’s made professionally with you know, a good microphone. You can use a smartphone as long as the environment in which you’re recording is pristine. The best place to do that if you don’t have access to an actual sound studio is if you know somebody who has a like a nice car, like Alexis or something where it’s it’s quiet inside. Make sure everything is off in the car. Obviously don’t do this in the middle summer will suffocate. You put the phone four to five inches from the person’s mouth turn on the voice memos app and you have them recite some scripts.

And the scripts that you want to have them recite. This is one of the catches with voice transfer should not be business content should not be a blog post because what you end up with when you have somebody reciting a blog post or business content, you get something that sounds like this. Trust insights will build a media model mix using stock performance data public relations campaigns and efforts organic search data public relations scenes outcomes. See what I mean? It’s very flat. There’s there’s not a lot of intonation. There’s not a lot of there’s not a lot of emphasis or variation.

So what should you use? Ask the person that you’re working with and this is something that you want to do in detail. Ask them for what their favorite TV show is and then go online, find a script from that episode of the episode that show and ask them to you have to do a little bit reading we want to ask them to read out some of their favorite shows script because it’s going to sound very different if they’re reading from something that’s a lot more dramatic, right?

You would see something like Yeah, I’ll read a segment here from a piece of fiction. You know, let me check the photon account. That doesn’t make any sense. She’s she’s calibrated the photonic gun to aim inward instead of down the test range. I don’t understand it’s like she’s holy shit, she’s gonna shoot it at herself.

You see how much more variance there is in the voice. That is the kind of sample that you want to use for any of these voice training tools because they are looking for variation. They’re looking for variants, they’re looking to learn as much as possible about your voice. And if you were just reading in a monotone or capturing the sound somebody’s voice in a monotone, it’s gonna be much harder for that software to capably generate good, varied audio.

If you have wildly varying audio, the tone and inflection things that really capture how a person really speaks, then you’re going to get a much better sample going to get much better output. And with a tool like for example, 11 labs, they’ll ask you for 10 different sound samples of varying lengths, you know, 30 seconds, a minute, two minutes, but it’s not how long or how much data you provide, it’s how diverse the data set is you want that variance.

So that’s my suggestion. Another way to do it would be to have them maybe recite as as prose recite as poetry, some of their favorite song links not to sing it, because you don’t want the musical intonation, but to read out some of their favorite song links, because you’ll still get some of that rhythm, you’ll still get some of that variation that variance in their voice that will capture the essence of their voice.

So that’s how you would do that. How you would you follow the steps in the software of your choice. But that’s how you do the sound samples so that you get good quality. Now, if the person that you’re working with has a body of public record already, you know, someone who’s an actor, someone who has been on a lot of podcasts, someone who does earnings calls, things you can go through those archives manually, and identify segments and snippets like, you know, this one time Bob was really yelling at that analyst on that call, okay, great, let’s take that segment, slice it down to 30 seconds or a minute or whatever the software requires, then you can put that in the training library for the way that these tools will memorize information.

But the key is that variance in tonality, and the way they speak in their actual voice. So those are my suggestions if you want to do a really good job with the actual voice and for capturing someone’s writing style. It’s a really good question. This kind of training data, gathering it polishing it is going to be really important in the next couple of years, right? The big challenge of AI is not the AI systems is having the data needed to generate good results.

So the sooner you get good at doing stuff like this, the easier it’s going to be for you. Thanks for the question and talk to you soon. If you’d like this video, go ahead and hit that subscribe button.

You Ask, I Answer: Capturing Voices with AI?

Machine-Generated Transcript

Comments

Leave a Reply Cancel reply

Pin It on Pinterest