When you release a digital assistant into the world, you hope it will never have to say the words “can you repeat that?”
Those four words signify a failure in the human/bot conversation. Perhaps the bot hadn’t been trained on the user’s accent. Perhaps the user’s request could be interpreted many different ways, and the bot wasn’t sure how to progress the conversation. Perhaps the bot simply couldn’t handle the noisy environment where it’s been deployed – in a restaurant for example.
Regardless of the reason, saying “can you repeat that” will result in the user losing confidence and the bot failing to meet user needs. This renders the whole thing pointless.
So, how can you avoid those disastrous four words? That’s exactly what Scott Stephenson, CEO of Deepgram and Dion Millson, CEO of Elerian AI spoke about with VUX World’s Kane Simms, sharing 7 steps to digital assistant success.
1. The importance of transcription
When you speak to a bot, Speech to Text (STT) is the technology that transcribes what you said. You’ll hear industry insiders talk about Speech to Text or Automatic Speech Recognition (ASR) but they’re actually the same thing – tech that transcribes the user’s spoken words.
A bad transcription simply means that the user has been misunderstood. For example, they said “add coffee” but the transcription was “had toffee”. From then onwards, the bot might try to take the conversation in a completely irrelevant direction. Worse still, say “I’m sorry, can you repeat that?”
So, the effectiveness of Speech to Text makes a huge difference in conversations with voicebots. Accurate transcription is the first element required to not only keep the conversation on track, but to begin the conversation in the first place.
Is it possible to get a perfect transcription?
The best transcribers are human and even they make mistakes. You can hope for around 99% accuracy at best. That’s good enough, and it’s become possible with advancements in deep learning.
2. The challenge of voice
When you send a text message to a friend, you don’t expect them to respond within milliseconds. Voice is different. We expect a vocal response to come much quicker than text. In America, the average pause length between turns is 0.74 seconds. Much longer and we assume something’s not right.
Users may think they’ve been misunderstood by the bot, or they may think the bot’s ignoring them. Regardless, it makes for a poor customer experience.
When you consider a human-bot conversation, a lot of processing has to happen on the bot’s side, including:
- Collecting the user’s audio input (their words as well as other possible signs, such as emotion)
- Transcribing the words said
- Working out what those words mean within the context of that conversation
- Deciding what the user wants to do
- Giving the user what they need at the same time as it generates a TTS response
And all of that in a little more time than it takes to blink.
That’s what we’re aiming for when we build ‘natural conversations’ digitally, and it’s no easy task.
Response time is a big challenge. The bot must react in a manner that feels like it’s real-time. And so, you need a tech stack that can process a lot of data quickly and work together to turnaround conversational turns with human-like response times.
3. Train your models on audio from your target use case
Let’s break down the sentence ‘train your models on audio from your target use case’. The data you use to train your ASR model should be specific to your use case. If you sell insurance, your users will use specific words and phrases when they call you. Those utterances may be different from the phrases you use internally within the company.
Who’s going to be talking to this bot?
If it’s for the company’s internal use, then go ahead and train it with the jargon you use in your company video calls. If it’s for a section of the general public, then you must use audio which represents how they talk about those things.
The audio you use must train your system on the:
- Words and phrases used by your customers
- Various accents that your customers may have
- Physical environment your customers will speak to you in
For this, you need a speech recognition system that allows you to retrain your models for specific use cases.
4. Start with the best STT you can get and then improve it
According to Scott Stevenson, CEO, Deepgram, the new wave of speech recognition start-ups, such as Deepgram, are achieving recognition accuracy between 85% and 90% from the start. Whereas legacy providers start at 65% to 75% accuracy. That’s a great head start, yet it can be improved further.
Once you’ve selected the best STT provider for your use case, you’ll need to adapt the model to your domain. Perhaps it’s missing vocabulary, or it doesn’t understand a specific accent from your locale, or your bot will be deployed in a noisy environment.
By focusing on training data for these unique needs, you will improve your bot and achieve ever-greater accuracy with speech recognition.
How to train speech recognition models for specific use cases
Scott says to start with 10 to 100 hours of labelled data, that’s representative of what you’re trying to cover. Get a machine transcription first, then get humans to go through and edit the bot’s work. Then you can realistically achieve around 99% accuracy.
In-house ASR training
If it makes sense to put your own team together for this task, the benefit will be that you can ensure the specific language within your domain gets covered. Outsourcing this work runs the risk that it’s performed by people who aren’t sensitive to your specific linguistic needs. In other words; you’ll be able to label your data better than anybody else.
Outsourcing ASR training
If outsourcing works better for you, it’s vital to establish a ‘style guide’ before work begins. Here, you’ll describe the various things users might say and what they mean. That will help the outsourced data labelling team to keep their work consistent and accurate.
Labelling data is a human intensive effort. One hour of audio takes five or more hours to label, and more than ten hours if the results need to be extremely accurate.
How accurate does it need to be? That depends on your use case and the business case behind it.
5. The importance of semantic understanding
STT may transcribe the user’s words perfectly, but what do the words mean? This is where NLU comes in, which deciphers the meaning behind the words the user said.
Here’s three things to remember about semantic understanding:
- Every user can ask for something in their own unique way, but your NLU must be able to understand all of them
- The NLU must be able to disambiguate between similar wordings with different meanings, such as “crash” meaning a vehicle accident or a frozen computer
- You need a strategy for continuous improvement. You will receive feedback that shows where you’re making consistent errors – how will you incorporate that feedback to improve the semantic understanding?
Dion Millson, CEO of Elerian AI, says that the company addresses the problem of semantic understanding by customising its speech recognition and Natural Language Understanding models per use case.
Dion states that this improves the agent’s ability to identify the entities important for that use case, better understand customer intents, and to interpret context throughout the interaction.
Accuracy in understanding is further enhanced by training the models on customer specific historical recorded data.
All conversations in the Elerian solution are transcribed and are fully auditable in an analytics dashboard. Any errors are flagged, corrected and fed back into the system for retraining, continually improving the performance of the agent.
6. It’s easier to learn a second language than the first
Deepgram’s pioneering work with transfer learning means AI can learn multiple languages quickly. The first language is the hardest, and then when they have a model trained on one language, it’s not a huge leap to train it on more languages.
If you start with English, then adding French can be done relatively quickly. This works because the two languages share a lot of similarities (for example they have the same alphabet, similar grammar and a lot of common vocabulary), so much of the work involved in creating an English language model can easily be copied over to a French one.
From there, the model is improved with specific training data related to French-speaking users, for example.
Moving forward, as more organisations roll out conversational AI solutions, having the ability to converse in multiple languages will make sure that you’re not only able to scale into new geographies, but you’re able to improve the accessibility of your home grown assistant.
7. You can’t fix what you don’t know about
Who’s aware of a bot’s problems and who has the power to fix them? Does the conversation designer, data scientist or developer know what problems the bot is running into? If they don’t know the recurrent problems, they’re not able to adapt their design to overcome those problems.
You want to know what the users are trying to do at every stage of their conversation with your bot. That knowledge will allow you to tweak the design to help users rather than hinder them.
Check out these two conversations with Benoit Alvarez, CEO, QBox, and Chistoph Boner, CEO, Botium, on quality assurance, NLU management and testing AI agents for more information.
Conclusion
Bots need to understand humans if they’re going to communicate with each other and deliver user and business value. Automated Speech Recognition and Natural Language Understanding are core elements of conversational AI, which facilitate the bot’s ability to understand human speech. How you choose, implement and tune these two technologies will have the single greatest impact on your ability to create voice assistants that actually understand people and never say “I’m sorry, can you repeat that?”
Thanks so much to Scott and Dion for sharing these insights. Check out the full interview with VUX world if you want to learn more.