Speech Recognition and Transcription Services Compared

Warning: this content is older than 365 days. It may be out of date and no longer relevant.
Speech Recognition and Transcription Services Compared.png

Many marketers have a need for transcription of audio and video data, but the prices and quality on the market vary wildly. Let’s compare the options and look at the transcription/speech recognition landscape to see what fits our marketing needs best.

Why Transcription?

We face more and more rich media content as marketers – audio, video, and interactive media. Yet most of our organic search value comes from good old plain text – words on a page. To make the most of the rich media we have, we need to convert the spoken words in our rich media into plain text for use in blog posts, eBooks, email, and other searchable content.

Transcription is the best way to accomplish this goal. Transcription helps us take advantage of existing content, rather than re-invent the wheel every time we need text-based content. The average person speaks at approximately 150 words per minute; the average blog post has approximately 300 words. Just two minutes of high-quality speaking could yield a blog post that might take a mediocre author an hour to draft. If we leverage the great audio and video content we’ve already created, we can make our content work harder for us in multiple formats.

The Transcription Landscape

Now that we understand why transcription matters, let’s look at the landscape of services available.

Human-powered transcription costs anywhere from 1 –3 per minute of recorded audio, and the results are usually top-notch because human beings have excellent voice recognition. Even outsourced, overseas transcription services generally yield good quality, especially for non-technical speech.

Professional automated services – usually with human supervision or quality checking – offer costs anywhere from 0.25 –1 per minute of recorded audio, and the quality is decent. A machine takes the first pass at the audio, then a human cleans up anomalies in the transcription.

Finally, fully-automated, AI-based transcription services such as IBM Watson and Google Cloud offer somewhat accurate transcription services for 1-2 pennies per minute of recorded audio. While the accuracy isn’t as good as human-assisted or human-powered, the cost savings are considerable.

Let’s look at what a one minute file of top-notch audio quality turns into with a few different services. Here’s the raw audio file if you’d like to compare:

https://soundcloud.com/cspenn/voicerecognitiontest

Professional Automated Service

This is an example of a typical type of voice file that you would want to transcribe. As one speaker it has words that someone would use while they were musing about a particular topic while they were thinking out loud. There’s some background noise from the environmental variables. There are the usual starts and stops and. Other background noises or file noise as you would expect from human communication that are part of conversation and part of the way people talk. And in particular if you are transcribing from a recording of some kind from a meeting or a phone call or a conference speech you’re going to get these kinds of background noises. So with this we’re going to test this out to see just how it sounds rather just how well each transcription service does with sort of a bake off if you will between one or more cognitive services. This file has been leveled using Adobe Audition So Reed is as good as it gets when it comes to audio quality and the microphone was a little boom microphone. That’s a boom condenser. So it’s very close to the speaking source. This is as pristine as you’re going to get when it comes to audio recordings especially when you’re dealing with stuff like conference calls where the microphone quality is is low to say the least. So let’s see what comes out when we run this through a couple of different voice recognition services.

Google Cloud

this is an example of a typical type of voice file that you would want to transcribe as one speaker it has words that someone would use while they were amusing about a particular topic while they were thinking out loud there’s some background noise from environmental variables there are the usual starts and stops and other background noises or or file noise as you’d expect from Human communication that’s a part of conversation and part of the way people talk and in particular if you are transcribing from a recording of some kind from a meeting or a phone call or a conference speech are you going to get these kinds of background noises so with this were going to test this out to see just how it sounds rather just how well each transcription service does

with the server Bake-Off if you will between one or more cognitive services

this file has been leveled using Adobe Audition so weed is as good as it gets when it comes to audio quality and the microphone was a little boom microphone that says boom condenser so it’s very close to the speaking Source this is as pristine as you’re going to get when it comes to audio recordings especially when you dealing with stuff like conference calls where the microphone quality is is low to say the least so let’s see what comes out when we run this through a couple different voice recognition services

IBM Watson

this is an example of a typical type of ,

voice file that you would want to transcribe ,

it has one speaker it has no words that someone would use while they were soon musing about a particular topic while they were thinking out loud ,

there’s some background noise from le heat environmental variables ,

there are the M. do the usual ,

starts and stops and ,

other %HESITATION background noises or or file notices you’d expect from human communication that are part of conversation and part of the way people talk and in particular if you are transcribing from a recording of some kind from a meeting or a phone call or a conference speech are you gonna get these kinds of background noise ,

so with this we gonna test this out to see just how it sounds %HESITATION I rather just how well each transcription service does with the server bake off if you will between ,

%HESITATION ,

one or more cognitive services ,

this file has been ,

leveled ,

using adobe audition so read is as good as it gets when it comes to audio quality and the microphone was a %HESITATION little boom microphone that say a boom condenser so it’s very close to the speaking source of this is ,

as pristine as you’re gonna get when it comes to audio recordings especially when you’re dealing with stuff like conference calls where %HESITATION the microphone quality is is low to say the least so let’s see what comes out we run this through a couple different voice recognition services.

Verdict: Use IBM Watson for Transcription

When it comes to the balance of quality and cost, IBM Watson’s speech recognition is the winner. The transcript is fairly accurate, the cost is 2 cents per recorded minute, and it’s in a usable format. Google Cloud is cheaper but it returns a literal pile of words – no punctuation or language recognition of any kind. The commercial service returns reasonably clean text with punctuation, but the accuracy isn’t much better than Watson – and certainly not 12.5x better, which is how much per minute more it costs.

For what the average marketer needs, IBM Watson is the way to go right now when it comes to transcription for content marketing purposes. Give it a go and see how it does with your content.


You might also enjoy:


Want to read more like this from Christopher Penn? Get updates here:

subscribe to my newsletter here


AI for Marketers Book
Take my Generative AI for Marketers course!

Analytics for Marketers Discussion Group
Join my Analytics for Marketers Slack Group!



Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Pin It on Pinterest

Shares
Share This