An introductory guide to SSML (Speech Synthesis Markup Language), what it is and how you can use it in speech and voice applications.
What is SSML?
SSML stands for Speech Synthesis Markup Language. It enables you to make tweaks and adjustments to synthetic voices (known as text-to-speech voices or TTS) to make them sound more natural or to correct common mispronunciations. Think of it like CSS, but for voice applications and speech systems.Think of SSML like CSS, but for voice applications. Click To Tweet
Not only can you make speech synthesis systems pronounce words differently using SSML, you can also add things like breaks and pauses, as well as speed-up, slow down or adjust the pitch of a voice, among other things, to change the cadence and delivery of speech to make it sound more natural.
Why do I need SSML?
When you listen to written dialogue spoken back trough TTS systems, it doesn’t always sound how you imagined or how you’d like. It might not quite sound ‘human’ or natural enough. It can sometimes sound quite jarring. SSML is a crucial tool to help you fix that.
For example, a TTS system might mispronounce your brand name. It might not say a certain word clearly enough. Perhaps it’s a little too quick, making what’s said hard to digest in the moment. Maybe the sound of the voice increases in pitch towards the end of a sentence, making your phrase sound like a question? You might want to emphasise a particular part of the sentence.
This is where SSML is useful.
How do I use SSML?
To use SSML within your dialogue system, you simply mark-up the dialogue like you would with any other code or copy in any other web application.
With web design, for example, to create a paragraph of text, you’d mark up your code with a ‘p’ tag that looks like the following:
<p>This is a paragraph of text.</p>
With dialogue systems, the same principles apply.
All spoken dialogue to be read back from a text-to-speech system must contain a <speak> tag, as follows:
<speak>Hi, my name is VUX.</speak>
The <speak> tag tells a TTS system that the words contained within it are intended to be spoken.
Once you have your dialogue wrapped in a speech tag, there are a bunch of other tags you can use within the <speech> tag to create the kind of effect you’re aiming for.
Examples of SSML tags
Common SSML tags that you can use to manipulate TTS systems are:
- audio: embeds an audio file into the dialogue. Great for adding things like earcons
- break: inserts a pause for a specified number of seconds or milliseconds
- emphasis: speaks tagged words louder and slower
- lang: specifies the intended language the voice should speak
- p: paragraph tag that adds pauses after tagged text to denote the end of a paragraph
- phoneme: allows you to construct specific pronunciation for words by assembling individual phonemes together from the phonetic alphabet
- prosody: lets you adjust the volume, rate (speed) and pitch of text
- s: adds a pause at the end of a sentence. Similar to the p tag, only a shorter pause
- say-as: lets you change how certain words, phrases or numbers are said. For example, if you want the number 1234 to be read “One, two, three, four”, or “One thousand, two hundred and thirty four”, and many other options.
- speak: the ‘root’ element. All spoken text is surrounded in a speak tag.
- sub: substitute one word for another. For example, pronounce the written word “e.g.” as “for example”
- voice: specify a TTS voice (commonly used in Alexa skills to call upon an Amazon Polly voice instead of the built-in Alexa voice)
- w: used to change the pronunciation of words from present to past principle, such as “read” and “read” i.e. “I’m going to read something” vs “I read a book yesterday”, and many other options
Then, some voice assistants and speech synthesis systems have their own specific tags unique to their platform. Broadly speaking, though, most systems allow for the standard tags to be used, but you should always check with your provider, just to be sure.
How to use SSML tags in speech synthesis systems
Inserting specific SSML tags into your dialogue is simple.
Let’s say you’d like your synthetic voice to pause for a moment at the end of a sentence. Just entering a period ‘.’ will accomplish this on its own, without the need for any specific markup. A ‘.’ will insert a pause of between 500ms and 1s, depending on the system.
But, let’s say you wanted to make a tweak to increase the length of the pause to create dramatic effect. Here, you’d insert a <break> tag at the point in the dialogue where you’d like the system to pause:
<speak>Hi, my name is VUX, and here's today's news. <break time="2s"/>
VUX World publishes a new guide to SSML...</speak>
Or perhaps you’d like to slow the speed of the dialogue down. You can try:
<speak><prosody rate="x-slow">Hi, my name is VUX.</prosody>
Nesting SSML tags
Just like with HTML, you can nest SSML tags within each other to stitch a number of dialogue manipulations together.
This is a little like of inserting an <a> tag within a <p> tag in web development:
<p>VUX World rocks. <a href="https://vux.world">Check it out here.</a></p>
Let’s say you wanted to raise the pitch of a single word, as well as have that word pronounced in a French accent, you could use something like this:
<speak><prosody pitch="high"><lang xml:lang="fr-FR">Bonjour!</lang></prosody></speak>
These kind of manipulations work with the standard TTS voices but not so well with neural net voices.
Testing and previewing SSML edits
It’s all well and good creating these lines of code, but the most important thing is hearing what it sounds like when it’s read through a TTS system. That’s the only way you’ll be able to tell whether the iterations and changes you’re making are having the desired audible effect.
There are a number of tools that let you do this, including:
Where can I find a full SSML reference guide?
For a full SSML reference guide with code examples for all tag types, try:
Note that these are written with Amazon, Google and Microsoft’s speech systems and voice assistant in mind. That means you might find that some tags don’t work in your TTS system of choice. However, to grab the code for the main tags and see all of the worked examples, it’s perfect.
The limitations of SSML
As fantastic as SSML is, it isn’t perfect. Every conversation designer has a story to tell about SSML and the pain its caused them at some point in their career: meticulously tweaking phonemes to pronounce a brand name just perfectly. Working in the appropriate level of pauses to fine tune the cadence to sound more natural. It all takes work.Every conversation designer has a story to tell about SSML and the pain its caused them at some point in their career. Click To Tweet
Plus, at some point, if you need to make really big changes, you’ll start to affect the quality of the actual voice and it’ll have a detrimental impact on your application.
To show you the real limitations of SSML, we need to take a step back to think about the voice you choose in the first place.
Before thinking about SSML, start with selecting the right voice
When creating voice applications, the first thing to think about when designing a voice is the use case the voice will be used for.
Many of the struggles conversation designers have with SSML stems from trying to use SSML to cover-up, bend or shape voices to have them do things that the voice just isn’t designed for.Many of the struggles conversation designers have with SSML stems from trying to use SSML to cover-up, bend or shape voices to have them do things that the voice just isn’t designed for. Click To Tweet
Most TTS systems are generic
Most standard TTS systems are designed to be general purpose. And they need to be because they can be used for a wide variety of use cases way beyond voice AI applications.
For example, you can use Amazon Polly to narrate a sales video, host your IVR, announce your PA system or read articles out loud.
Because they need to be versatile enough to be used everywhere, for a wide range of things, out-of-the-box TTS systems are… Well, bland.
They are purposefully generic.
Manipulating generic TTS with SSML doesn’t always end well
Generic voices make TTS systems hard to change for certain use cases. For example, let’s say you’re designing a voice application and you want your voice to sound super-animated and happy. Most conversation designers will write the dialogue, choose a voice, then get to work hacking away at the SSML to make a generic voice sound more enthusiastic. That’s a little bit like trying to turn a guitar into a piano because generic TTS voices are built for generic purposes.Hacking away at SSML to make a generic TTS voice sound more enthusiastic is like trying to turn a guitar into a piano. Click To Tweet
The reason Amazon have things like a ‘newsreader’ voice for Alexa is because that use case requires a completely different kind of voice. They can’t hack away at the standard Alexa voice with SSML to make it sound more like a newsreader, they have to build the entire TTS engine from the beginning with that use case and purpose in mind.
These days, they might have done that using deep learning models, but they may well have had to record entirely new audio from the person who voices Alexa.
Crap-in = crap-out
When you create a TTS voice from scratch, you base it on recordings of a human speaker.
You record at least 3 hours of audio, transcribe it, label it, break it down into phonemes, synthesise it, build your model, train it and tweak it until you have your voice.
So the voice you end up with is directly correlated to the human recorded samples that you started with. If those human recordings are excitable and happy, you’ll have an excitable and happy TTS. If those recordings are straight down the middle and bland, then the TTS you’ll end up with will be the same. And SSML can’t change that.
Why SSML can’t change the sound of a voice as much as you want it to
When using SSML, what you’re essentially doing is giving the TTS system instructions about how it should manipulate the audio it produces. You’re not ‘tuning a voice’. Not really. You’re manipulating audio.
For example, if you use the <prosody> tag to slow down the rate of speech; technically, to slow down the voice, all the TTS system is doing is time-stretching the audio it produces to make each sample longer.
When you time-stretch an audio file, there’s a breaking point. A point at which the audio is stretched so much that it distorts.
That’s because audio files have a ‘bit rate’. That’s the number of ‘bits’ of data in 1 second of audio.When using SSML, you’re giving the TTS system instructions about how it should manipulate audio. You’re not ‘tuning a voice’. You’re manipulating audio. Click To Tweet
Why audio bit rates matter
A CD (compact disk, remember those?) can play audio files at 44,100 bits per second. That’s 44,100 individual bits of data hitting your ears in succession in a single second.
Naturally, to the human ear, you can’t distinguish each individual bit and so it sounds like a steady stream of audio.
It works in the same way as a video. A 24 frame per second video is just 24 still images in a row. Your eyes can’t process the images changing fast enough, so it looks like a video. The same is true of audio.
An mp3 that you’ll stream on Spotify will have a lesser bit rate of 1,280-3,600 bits per second. To the trained ear, there’s a noticeable drop in quality compared to a CD. Any audio engineer will tell you that an MP3 sounds terrible.
What’s the bit rate of TTS systems?
TTS systems can have a ‘bit rate’ as low as 200 bits per second, or as high as 1,200. Low bit rates are helpful because it helps the audio be returned to the voice application quickly, and with as little internet bandwidth usage as possible.
One reason for this is to make sure the system can still speak at times when the internet connection dips (imagine Alexa just didn’t respond because your internet connection was too weak; not good).
The other reason is so that it can serve the audio quickly enough to simulate a realistic conversation with the user. Humans tend to respond to one another within 200 milliseconds. If it takes much longer than that for a voice application to respond, the user experience is compromised and the interface feels less conversational. It’s as if the system didn’t hear you, can’t answer you or isn’t clever enough to think as quickly.
Low bit rates mean SSML over-manipulation has flakey results
Because TTS systems typically produce low bit rate (low ‘quality’) audio files to serve to users in voice user interfaces, applying SSML markup (audio manipulation) can result in an even lower quality audio in the end.
That’s why your SSML doesn’t always sound great.
In the example of using the <prosody> SSML tag to slow down the rate (speed) of a voice; as you slow down the audio, you’re time-stretching a low bit rate audio file to breaking point. It sounds distorted because it is distorted. It’s stretching the audio so much that it’s creating little gaps of silence in between those ‘bits’. Then, it’s pitch-shifting the audio down a few keys, so it sounds ‘deeper’ (you can’t slow down a piece of audio without lowering the pitch). Try asking Google Assistant to spell ‘something’ and you’ll see what I mean. You’ll hear a little gargling sound behind the voice, as if it’s talking with a mouthful of water. That’s the distortion.
And that’s just one part of SSML. That’s one tag. These kind of limitations exist to greater or lesser extent with many SSML tags, which is why SSML can’t be used for large voice manipulations.
What you should use SSML for
You should use SSML for cosmetic improvements. To pronounce words in certain ways, adjust the cadence of speech and for small things. Tweaks.
If you use it for fine-tuning, you’ll get along with it just fine.SSML can’t be used for large voice manipulations. Click To Tweet
If you find yourself hacking away at SSML, trying to manipulate the voice so much that it’s breaking-up and you just can’t get it to sound right; chances are, you’re pushing SSML further than it’s designed for and you’re working with a voice that’s just not right for your use case.
In that case, you’d be better off choosing another voice or doing the hard work to build your own TTS for that use case.
As the technology advances, things might change. Maybe custom TTS becomes easier to create and is democratised, like what Resemble.ai is trying to do.
Until then though, if you choose your initial voice wisely and stick to using SSML for tweaks and tuning, you’ll be fine.