The next generation of synthetic human-like voices

Ben McCulloch
April 22, 2022
in Article, Opinion

The next generation of synthetic human-like voices https://vux.world/wp-content/uploads/Black-and-Yellow-Modern-Social-Media-Marketing-Trends-Presentation-1.jpg 1920 1080 Ben McCulloch Ben McCulloch https://secure.gravatar.com/avatar/b1f3549c2d953651d69f59ec1fa801a3?s=96&d=blank&r=g April 22, 2022 April 22, 2022

Why have film-makers been slow to adopt synthetic voice? They could generate voices quickly and more cheaply. They could pick and choose from a huge variety of different voices and accents.

Let’s consider film production from the perspective of voice; first the dialogue is written, then it’s performed by actors, and then film editors use those performances as the guide for their edit. The actor’s voices are used to guide everybody who creates the final film.

It makes no sense to shoe-horn text-to-speech (TTS) into that workflow – the production would fall apart because they wouldn’t have that guide. Who would generate it, make creative choices about performance and tweak it to perfection? Films are made fast – nobody has time to endlessly tweak things.

And why bother when you can work with actors who provide nuanced performances? Actors become the characters they portray – that’s the glue that holds films together.

The importance of voice actors

Think of this example from The Empire Strikes Back. Imagine if Darth Vader had said “I am your father”, and Luke had replied “no”, as if someone had asked whether he wants olives on his pizza? The moment would have felt dumb. Luke’s feelings would have seemed unconvincing. The massive twist in the story would have been unimpressive. That’s why he replies “no? NO!” with a little crescendo, a strained voice and tears in his eyes.

If the film editor had to work with an unemotional TTS performance of Luke they wouldn’t be able to edit a powerful scene and the film wouldn’t have its climax. The raw materials would have been crappy.

Transition to synthetic voices

Now film-makers are starting to adopt synthetic voice. Why? One company seems to have solved the biggest issue that was holding other providers back. They’re called Respeecher, they’ve and have been used for high-end productions like The Mandalorian. VUX World spoke with their CEO Alex Serdiuk.

Let’s consider why they’ve succeeded where many synthetic voice providers have failed to start.

First we need to look at SSML (Speech Synthesis Markup Language). TTS is synonymous with SSML, and let’s face it – SSML has issues.

The concept makes sense at least. You choose a synthesized voice and then use SSML to tweak it. For example, you can improve the pronunciation of your brandname, or you can emphasize certain words so that users are sure what their options are when they speak with a voice assistant.

Why’s SSML and TTS a problem?

Firstly, nobody thinks that way when they speak. We know what we mean when we talk and we use everything available to express it – the words and the way we say them, facial expressions and body language. We don’t imagine the words and then dress them up with prosody while we say them. We have speech patterns and alter them depending on context – for example the way we speak to a child or pet are fundamentally different from the way we would speak to a courtroom. One is tender and personal whereas the other is objective and measured.

Secondly, creating a design concept for SSML is hard. You need to know exactly what you want before you start adding markups to everything. You need to think through the whole performance – word by word – and then consider if it should be more emphasis here or lower pitch there or many other possibilities. SSML is an abstract language – how can you tell what dramatic effect pitch, loudness or speaking rate will have unless you’ve studied vocal performance? Very few people using SSML have that knowledge. Crafting a clear sounding voice assistant is possible, but creating a dramatic performance for a film is damn-near impossible.

Thirdly, it doesn’t always work. You can spend hours tweaking SSML and only make meagre improvements because the system might not allow for what you want to achieve. The only way to test what you’ve done is to listen. You can get stuck in the loop of adjust – listen – adjust – listen for hours until you’ve lost all perspective. This isn’t efficient.

Respeecher’s differentiation

Resepeecher’s STS (speech to speech) is different. You don’t write text and then apply SSML. You start with a natural vocal performance – for example shouting a bellowed “nooooo!” into your microphone, and then your voice could be converted into Luke Skywalker’s (this is just an example – they don’t allow anyone to use actor Mark Hamill’s voice for ethical reasons). It would sound like Luke Skywalker is saying your words, the way you said them. All the emotion that was present when you spoke is still apparent even when it’s said with a different voice. The right emotion can be expressed at just the right moment to tell the story.

Sound teams are used to working with voice actors so Respeecher’s tech fits into a sound team’s workflow with very little friction and plenty of benefits. This is why it works. It seems so easy and obvious but it’s a massive difference from the normal TTS workflow – to write words and then mark them up with SSML. Respeecher matches the way we use our voices to express ourselves, so we can focus on performance rather than tweaking with an abstract coding language. Revolutionary.

This workflow isn’t specific to Hollywood – if you find yourself getting bogged down with SSML then perhaps it’s worth considering Respeecher?

Here’s a video of their software in action.

Listen to Alex Serdiuk speak all about the speech to speech technology of Respeecher on the VUX World podcast. Listen on YouTube, Apple podcasts, Spotify or wherever you get your podcasts.

A few words about Ukraine

Respeecher is a Ukrainian company, and much of their team still live there, perilously surviving under an unprovoked attack from the Russian army. Respeecher are doing great work within these horrific circumstances, and are even using their technology to boost morale at home.

Have a look.

This article was written by Benjamin McCulloch. Ben is a freelance conversation designer and an expert in audio production. He has a decade of experience crafting natural sounding dialogue: recording, editing and directing voice talent in the studio. Some of his work includes dialogue editing for Philips’ ‘Breathless Choir’ series of commercials, a Cannes Pharma Grand-Prix winner; leading teams in localizing voices for Fortune 100 clients like Microsoft, as well as sound design and music composition for video games and film.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
lang	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
resolution	session	This is a functionality cookie used to collect the horizontal value of the visitor screen resolution. It helps in optimizing the website view to the user.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_111445333_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
ajs_anonymous_id	never	This cookie is set by Segment.io to check the number of ew and returning visitors to the website.
CONSENT	16 years 2 months 25 days 18 hours	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.

Cookie	Duration	Description
bscookie	2 years	This cookie is a browser ID cookie set by Linked share Buttons and ad tags.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__smSessionId	9 hours	No description available.
__smToken	1 year	This cookie is set by the Sumo. This cookie is used for verifying whether the user is logged in or not.
__smVID	1 month	This cookie is set by Sumo. The purpose of the cookie is not yet known.
_mailmunch_visitor_id	never	This cookie is set by MailMunch which is email collection and email marketing platform. We do not know the exact purpose of the cookie.
AnalyticsSyncHistory	1 month	No description
attribution_user_id	1 year	This cookie is set by the provider Typeform. This cookie is used for Typeform usage statistics. It is used in context with the website's pop-up questionnaires and messengering.
cookielawinfo-checkbox-functional	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
debug	never	No description available.
intercom-id-or0x2acp	8 months 26 days 1 hour	No description
intercom-session-or0x2acp	7 days	No description
li_gc	2 years	No description
li_sugr	3 months	No description available.
mailmunch_second_pageview	never	This cookie is set by MailMunch which is email collection and email marketing platform. We do not know the exact purpose of the cookie.
UserMatchHistory	1 month	Linkedin - Used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.