SSML is great for tuning text-to-speech (TTS) systems, but it has some limitations. Understanding these limitations, why they exist and how to deal with them, will help you create more effective voice user interfaces and prevent you from hitting a wall later down the line.
Before we dive into the limitations, it’s helpful to explain what we’re actually talking about with some high-level SSML basics.
What is SSML?
SSML stands for Speech Synthesis Markup Language. It enables you to make tweaks and adjustments to synthetic voices (TTS) to make them sound more natural or to correct common mispronunciations. Think of it like CSS, but for voice applications.Think of SSML like CSS, but for voice applications. Click To Tweet
Not only can you make speech synthesis systems pronounce words differently using SSML, you can also add things like breaks and pauses, as well as speed-up, slow down or adjust the pitch, among other things, to change the cadence and delivery of speech to make it sound more natural.
Why do I need SSML?
When you listen to written dialogue spoken back trough TTS, it might not quite sound how you imagined. It might not quite sound ‘human’ or natural enough. SSML is a crucial tool to help you fix that.
For example, it might mispronounce your brand name. It might not say a certain word clearly enough. Perhaps it’s a little too quick, making what’s said hard to digest in the moment. Maybe the sound of the voice increases in pitch towards the end of a sentence, making your phrase sound like a question? You might want to emphasise a particular part of the sentence.
This is where SSML is useful.
Types of SSML markup
Common SSML tags that you can use to manipulate TTS systems are:
- audio: embeds an audio file into the dialogue. Great for adding things like earcons
- break: inserts a pause for a specified number of seconds or milliseconds
- emphasis: speaks tagged words louder and slower
- lang: specifies the intended language the voice should speak
- p: paragraph tag that adds pauses after tagged text to denote the end of a paragraph
- phoneme: allows you to construct specific pronunciation for words by assembling individual phonemes together from the phonetic alphabet
- prosody: lets you adjust the volume, rate (speed) and pitch of text
- s: adds a pause at the end of a sentence. Similar to the p tag, only a shorter pause
- say-as: lets you change how certain words, phrases or numbers are said. For example, if you want the number 1234 to be read “One, two, three, four”, or “One thousand, two hundred and thirty four”, and many other options.
- speak: the ‘root’ element. All spoken text is surrounded in a speak tag.
- sub: substitute one word for another. For example, pronounce the written word “e.g.” as “for example”
- voice: specify a TTS voice (commonly used in Alexa skills to call upon an Amazon Polly voice instead of the built-in Alexa voice)
- w: used to change the pronunciation of words from present to past principle, such as “read” and “read” i.e. “I’m going to read something” vs “I read a book yesterday”, and many other options
Then, some voice assistants have their own specific tags unique to their platform. Broadly speaking, though, most systems allow for the standard tags to be used, but you should always check with your provider, just to be sure.
The limitations of SSML
SSML isn’t perfect. Every conversation designer has a story to tell about SSML and the pain its caused them at some point in their career: meticulously tweaking phonemes to pronounce a brand name just perfectly. Working in the appropriate level of pauses to fine tune the cadence to sound more natural. It all takes work.Every conversation designer has a story to tell about SSML and the pain its caused them at some point in their career. Click To Tweet
There are some no code drag and drop SSML editors out there that can make that fiddly bit a little easier for non-technical folks, but still, there are limits of SSML that no tool can fix.
To show you the real limitations of SSML, we need to take a step back to think about the voice you choose in the first place.
Before thinking about SSML, start with selecting the right voice
When creating voice applications, the first thing to think about when designing a voice is the use case the voice will be used for.
Many of the struggles conversation designers have with SSML stems from trying to use SSML to cover-up, bend or shape voices to have them do things that the voice just isn’t designed for.Many of the struggles conversation designers have with SSML stems from trying to use SSML to cover-up, bend or shape voices to have them do things that the voice just isn’t designed for. Click To Tweet
Most TTS systems are generic
Most standard TTS systems are designed to be general purpose. And they need to be because they can be used for a wide variety of use cases way beyond voice AI applications.
For example, you can use Amazon Polly to narrate a sales video, host your IVR, announce your PA system or read articles out loud.
Because they need to be versatile enough to be used everywhere, for a wide range of things, out-of-the-box TTS systems are… Well, bland.
They are purposefully generic.
Manipulating generic TTS with SSML doesn’t always end well
Generic voices make TTS systems hard to change for certain use cases. For example, let’s say you’re designing a voice application and you want your voice to sound super-animated and happy. Most conversation designers will write the dialogue, choose a voice, then get to work hacking away at the SSML to make a generic voice sound more enthusiastic. That’s a little bit like trying to turn a guitar into a piano because generic TTS voices are built for generic purposes.Hacking away at SSML to make a generic TTS voice sound more enthusiastic is like trying to turn a guitar into a piano. Click To Tweet
The reason Amazon have things like a ‘newsreader’ voice for Alexa is because that use case requires a completely different kind of voice. They can’t hack away at the standard Alexa voice with SSML to make it sound more like a newsreader, they have to build the entire TTS engine from the beginning with that use case and purpose in mind.
These days, they might have done that using deep learning models, but they may well have had to record entirely new audio from the person who voices Alexa.
Crap-in = crap-out
When you create a TTS voice from scratch, you base it on recordings of a human speaker.
You record at least 3 hours of audio, transcribe it, label it, break it down into phonemes, synthesise it, build your model, train it and tweak it until you have your voice.
So the voice you end up with is directly correlated to the human recorded samples that you started with. If those human recordings are excitable and happy, you’ll have an excitable and happy TTS. If those recordings are straight down the middle and bland, then the TTS you’ll end up with will be the same. And SSML can’t change that.
Why SSML can’t change the sound of a voice as much as you want it to
When using SSML, what you’re essentially doing is giving the TTS system instructions about how it should manipulate the audio it produces. You’re not ‘tuning a voice’. Not really. You’re manipulating audio.
For example, if you use the <prosody> tag to slow down the rate of speech; technically, to slow down the voice, all the TTS system is doing is time-stretching the audio it produces to make each sample longer.
When you time-stretch an audio file, there’s a breaking point. A point at which the audio is stretched so much that it distorts.
That’s because audio files have a ‘bit rate’. That’s the number of ‘bits’ of data in 1 second of audio.When using SSML, you’re giving the TTS system instructions about how it should manipulate audio. You’re not ‘tuning a voice’. You’re manipulating audio. Click To Tweet
Why audio bit rates matter
A CD (compact disk, remember those?) can play audio files at 44,100 bits per second. That’s 44,100 individual bits of data hitting your ears in succession in a single second.
Naturally, to the human ear, you can’t distinguish each individual bit and so it sounds like a steady stream of audio.
It works in the same way as a video. A 24 frame per second video is just 24 still images in a row. Your eyes can’t process the images changing fast enough, so it looks like a video. The same is true of audio.
An mp3 that you’ll stream on Spotify will have a lesser bit rate of 1,280-3,600 bits per second. To the trained ear, there’s a noticeable drop in quality compared to a CD. Any audio engineer will tell you that an MP3 sounds terrible.
What’s the bit rate of TTS systems?
TTS systems can have a ‘bit rate’ as low as 200 bits per second, or as high as 1,200. Low bit rates are helpful because it helps the audio be returned to the voice application quickly, and with as little internet bandwidth usage as possible.
One reason for this is to make sure the system can still speak at times when the internet connection dips (imagine Alexa just didn’t respond because your internet connection was too weak; not good).
The other reason is so that it can serve the audio quickly enough to simulate a realistic conversation with the user. Humans tend to respond to one another within 200 milliseconds. If it takes much longer than that for a voice application to respond, the user experience is compromised and the interface feels less conversational. It’s as if the system didn’t hear you, can’t answer you or isn’t clever enough to think as quickly.
Low bit rates mean SSML over-manipulation has flakey results
Because TTS systems typically produce low bit rate (low ‘quality’) audio files to serve to users in voice user interfaces, applying SSML markup (audio manipulation) can result in an even lower quality audio in the end.
That’s why your SSML doesn’t always sound great.
In the example of using the <prosody> SSML tag to slow down the rate (speed) of a voice; as you slow down the audio, you’re time-stretching a low bit rate audio file to breaking point. It sounds distorted because it is distorted. It’s stretching the audio so much that it’s creating little gaps of silence in between those ‘bits’. Then, it’s pitch-shifting the audio down a few keys, so it sounds ‘deeper’ (you can’t slow down a piece of audio without lowering the pitch). Try asking Google Assistant to spell ‘something’ and you’ll see what I mean. You’ll hear a little gargling sound behind the voice, as if it’s talking with a mouthful of water. That’s the distortion.
And that’s just one part of SSML. That’s one tag. These kind of limitations exist to greater or lesser extent with many SSML tags, which is why SSML can’t be used for large voice manipulations.
What you should use SSML for
You should use SSML for cosmetic improvements. To pronounce words in certain ways, adjust the cadence of speech and for small things. Tweaks.
If you use it for fine-tuning, you’ll get along with it just fine.SSML can’t be used for large voice manipulations. Click To Tweet
If you find yourself hacking away at SSML, trying to manipulate the voice so much that it’s breaking-up and you just can’t get it to sound right; chances are, you’re pushing SSML further than it’s designed for and you’re working with a voice that’s just not right for your use case.
In that case, you’d be better off choosing another voice or doing the hard work to build your own TTS for that use case.
As the technology advances, things might change. Maybe custom TTS becomes easier to create and is democratised, like what Resemble.ai is trying to do.
Until then though, if you choose your initial voice wisely and stick to using SSML for tweaks and tuning, you’ll be fine.