7 steps to stop bots saying ‘can you repeat that?’

Ben McCulloch
May 27, 2022
in Article, Guides

7 steps to stop bots saying ‘can you repeat that?’ https://vux.world/wp-content/uploads/Untitled-1.png 1890 1417 Ben McCulloch Ben McCulloch https://secure.gravatar.com/avatar/b1f3549c2d953651d69f59ec1fa801a3?s=96&d=blank&r=g May 27, 2022 May 27, 2022

When you release a digital assistant into the world, you hope it will never have to say the words “can you repeat that?”

Those four words signify a failure in the human/bot conversation. Perhaps the bot hadn’t been trained on the user’s accent. Perhaps the user’s request could be interpreted many different ways, and the bot wasn’t sure how to progress the conversation. Perhaps the bot simply couldn’t handle the noisy environment where it’s been deployed – in a restaurant for example.

Regardless of the reason, saying “can you repeat that” will result in the user losing confidence and the bot failing to meet user needs. This renders the whole thing pointless.

So, how can you avoid those disastrous four words? That’s exactly what Scott Stephenson, CEO of Deepgram and Dion Millson, CEO of Elerian AI spoke about with VUX World’s Kane Simms, sharing 7 steps to digital assistant success.

1. The importance of transcription

When you speak to a bot, Speech to Text (STT) is the technology that transcribes what you said. You’ll hear industry insiders talk about Speech to Text or Automatic Speech Recognition (ASR) but they’re actually the same thing – tech that transcribes the user’s spoken words.

A bad transcription simply means that the user has been misunderstood. For example, they said “add coffee” but the transcription was “had toffee”. From then onwards, the bot might try to take the conversation in a completely irrelevant direction. Worse still, say “I’m sorry, can you repeat that?”

So, the effectiveness of Speech to Text makes a huge difference in conversations with voicebots. Accurate transcription is the first element required to not only keep the conversation on track, but to begin the conversation in the first place.

Is it possible to get a perfect transcription?

The best transcribers are human and even they make mistakes. You can hope for around 99% accuracy at best. That’s good enough, and it’s become possible with advancements in deep learning.

2. The challenge of voice

When you send a text message to a friend, you don’t expect them to respond within milliseconds. Voice is different. We expect a vocal response to come much quicker than text. In America, the average pause length between turns is 0.74 seconds. Much longer and we assume something’s not right.

Users may think they’ve been misunderstood by the bot, or they may think the bot’s ignoring them. Regardless, it makes for a poor customer experience.

When you consider a human-bot conversation, a lot of processing has to happen on the bot’s side, including:

Collecting the user’s audio input (their words as well as other possible signs, such as emotion)
Transcribing the words said
Working out what those words mean within the context of that conversation
Deciding what the user wants to do
Giving the user what they need at the same time as it generates a TTS response

And all of that in a little more time than it takes to blink.

That’s what we’re aiming for when we build ‘natural conversations’ digitally, and it’s no easy task.

Response time is a big challenge. The bot must react in a manner that feels like it’s real-time. And so, you need a tech stack that can process a lot of data quickly and work together to turnaround conversational turns with human-like response times.

3. Train your models on audio from your target use case

Let’s break down the sentence ‘train your models on audio from your target use case’. The data you use to train your ASR model should be specific to your use case. If you sell insurance, your users will use specific words and phrases when they call you. Those utterances may be different from the phrases you use internally within the company.

Who’s going to be talking to this bot?

If it’s for the company’s internal use, then go ahead and train it with the jargon you use in your company video calls. If it’s for a section of the general public, then you must use audio which represents how they talk about those things.

The audio you use must train your system on the:

Words and phrases used by your customers
Various accents that your customers may have
Physical environment your customers will speak to you in

For this, you need a speech recognition system that allows you to retrain your models for specific use cases.

4. Start with the best STT you can get and then improve it

According to Scott Stevenson, CEO, Deepgram, the new wave of speech recognition start-ups, such as Deepgram, are achieving recognition accuracy between 85% and 90% from the start. Whereas legacy providers start at 65% to 75% accuracy. That’s a great head start, yet it can be improved further.

Once you’ve selected the best STT provider for your use case, you’ll need to adapt the model to your domain. Perhaps it’s missing vocabulary, or it doesn’t understand a specific accent from your locale, or your bot will be deployed in a noisy environment.

By focusing on training data for these unique needs, you will improve your bot and achieve ever-greater accuracy with speech recognition.

How to train speech recognition models for specific use cases

Scott says to start with 10 to 100 hours of labelled data, that’s representative of what you’re trying to cover. Get a machine transcription first, then get humans to go through and edit the bot’s work. Then you can realistically achieve around 99% accuracy.

In-house ASR training

If it makes sense to put your own team together for this task, the benefit will be that you can ensure the specific language within your domain gets covered. Outsourcing this work runs the risk that it’s performed by people who aren’t sensitive to your specific linguistic needs. In other words; you’ll be able to label your data better than anybody else.

Outsourcing ASR training

If outsourcing works better for you, it’s vital to establish a ‘style guide’ before work begins. Here, you’ll describe the various things users might say and what they mean. That will help the outsourced data labelling team to keep their work consistent and accurate.

Labelling data is a human intensive effort. One hour of audio takes five or more hours to label, and more than ten hours if the results need to be extremely accurate.

How accurate does it need to be? That depends on your use case and the business case behind it.

5. The importance of semantic understanding

STT may transcribe the user’s words perfectly, but what do the words mean? This is where NLU comes in, which deciphers the meaning behind the words the user said.

Here’s three things to remember about semantic understanding:

Every user can ask for something in their own unique way, but your NLU must be able to understand all of them
The NLU must be able to disambiguate between similar wordings with different meanings, such as “crash” meaning a vehicle accident or a frozen computer
You need a strategy for continuous improvement. You will receive feedback that shows where you’re making consistent errors – how will you incorporate that feedback to improve the semantic understanding?

Dion Millson, CEO of Elerian AI, says that the company addresses the problem of semantic understanding by customising its speech recognition and Natural Language Understanding models per use case.

Dion states that this improves the agent’s ability to identify the entities important for that use case, better understand customer intents, and to interpret context throughout the interaction.

Accuracy in understanding is further enhanced by training the models on customer specific historical recorded data.

All conversations in the Elerian solution are transcribed and are fully auditable in an analytics dashboard. Any errors are flagged, corrected and fed back into the system for retraining, continually improving the performance of the agent.

6. It’s easier to learn a second language than the first

Deepgram’s pioneering work with transfer learning means AI can learn multiple languages quickly. The first language is the hardest, and then when they have a model trained on one language, it’s not a huge leap to train it on more languages.

If you start with English, then adding French can be done relatively quickly. This works because the two languages share a lot of similarities (for example they have the same alphabet, similar grammar and a lot of common vocabulary), so much of the work involved in creating an English language model can easily be copied over to a French one.

From there, the model is improved with specific training data related to French-speaking users, for example.

Moving forward, as more organisations roll out conversational AI solutions, having the ability to converse in multiple languages will make sure that you’re not only able to scale into new geographies, but you’re able to improve the accessibility of your home grown assistant.

7. You can’t fix what you don’t know about

Who’s aware of a bot’s problems and who has the power to fix them? Does the conversation designer, data scientist or developer know what problems the bot is running into? If they don’t know the recurrent problems, they’re not able to adapt their design to overcome those problems.

You want to know what the users are trying to do at every stage of their conversation with your bot. That knowledge will allow you to tweak the design to help users rather than hinder them.

Check out these two conversations with Benoit Alvarez, CEO, QBox, and Chistoph Boner, CEO, Botium, on quality assurance, NLU management and testing AI agents for more information.

Conclusion

Bots need to understand humans if they’re going to communicate with each other and deliver user and business value. Automated Speech Recognition and Natural Language Understanding are core elements of conversational AI, which facilitate the bot’s ability to understand human speech. How you choose, implement and tune these two technologies will have the single greatest impact on your ability to create voice assistants that actually understand people and never say “I’m sorry, can you repeat that?”

Thanks so much to Scott and Dion for sharing these insights. Check out the full interview with VUX world if you want to learn more.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
lang	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
resolution	session	This is a functionality cookie used to collect the horizontal value of the visitor screen resolution. It helps in optimizing the website view to the user.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_111445333_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
ajs_anonymous_id	never	This cookie is set by Segment.io to check the number of ew and returning visitors to the website.
CONSENT	16 years 2 months 25 days 18 hours	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.

Cookie	Duration	Description
bscookie	2 years	This cookie is a browser ID cookie set by Linked share Buttons and ad tags.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__smSessionId	9 hours	No description available.
__smToken	1 year	This cookie is set by the Sumo. This cookie is used for verifying whether the user is logged in or not.
__smVID	1 month	This cookie is set by Sumo. The purpose of the cookie is not yet known.
_mailmunch_visitor_id	never	This cookie is set by MailMunch which is email collection and email marketing platform. We do not know the exact purpose of the cookie.
AnalyticsSyncHistory	1 month	No description
attribution_user_id	1 year	This cookie is set by the provider Typeform. This cookie is used for Typeform usage statistics. It is used in context with the website's pop-up questionnaires and messengering.
cookielawinfo-checkbox-functional	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
debug	never	No description available.
intercom-id-or0x2acp	8 months 26 days 1 hour	No description
intercom-session-or0x2acp	7 days	No description
li_gc	2 years	No description
li_sugr	3 months	No description available.
mailmunch_second_pageview	never	This cookie is set by MailMunch which is email collection and email marketing platform. We do not know the exact purpose of the cookie.
UserMatchHistory	1 month	Linkedin - Used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.