A framework for consistently measuring the usability of voice and conversational user interfaces

A framework for consistently measuring the usability of voice and conversational user interfaces 1800 1200 Kane Simms

Answering one of the questions we get asked repeatably: how to measure the usability of conversational user interfaces, like chat bots and voice bots, and their impact on customer experience.

a framework for consistently measuring the usability of conversational user interfaces by Kane Simms

We’ve found Nielsen Norman group’s general usability measuring framework applies just as well in measuring the usability of voice and conversational interfaces and applications, too.

So here’s an overview of the framework and how you apply it to conversational AI. Next time you’re working on a chat bot or voice bot application, consider this throughout your design, development and usability testing phases.


Learnability is concerned with how quickly a user can understand and use the application and, subsequently, how that contributes to how fast they can accomplish tasks.

You might think that, with conversational applications, learnability is built-in. Most people know how to talk or type and so what’s to learn, right?

Well, not quite. Learnability with conversational user interfaces can be measured by looking at things like:

1. Access

How easy is it to access your application? Does it require much effort? Is it intuitive?

Consider things like wake word detection. How easy is it to learn? The launch icon on your web page or app, is it obvious and easily recognisable?

TRY THIS: Present your solution in-situ during usability testing. If it’s a chatbot, put it on a webpage behind a chat icon. If it’s an Alexa skill, put it on a smart speaker for testing. This will allow you to understand how easy you application is for people to access.

2. Welcome message

Is your application clear about what it can and can’t do? Does it set the user expectations at the outset?

TRY THIS: Observe how users respond to the welcome message and measure the % of initial utterances that are ‘out of scope’. You can also ask users during user research interviews what their expectation of the application is after hearing the welcome message or description. This’ll help you understand whether you’re allowing people to get off on the right foot.

3. Prompts and responses

Is your conversational AI clear at every turn what’s expected of the user’s response? Do your prompts leave little room for interpretation and do they consistently solicit the kind of response you’d expect from the user?

TRY THIS: During testing, look out for any friction or confusion caused off-the-back of your prompts, as users progress through the conversation. Make a note of the prompts that solicit unexpected responses or that cause friction. Ask users after their interaction to elaborate on what they interpreted the prompt to mean and why they answered in the way they did.

This will help you understand how to craft a better prompt as well as give you understanding into additional directions the conversation might need to take.

4. UI elements

Do users make use of additional UI options? If you’re using screen displays for voice applications, do they help or hinder? If you’re using buttons and carousels in your chat bot, are they easy to navigate? Are they too constraining? Are users aware they can type or speak responses as well as click options?

TRY THIS: Allow users to test your application in the most natural and realistic situation for them. If they use smart displays at home, test on those. If they prefer chat, use that. Then, observe how users interact with the application without being prompted to determine how easy to learn your UI elements are. If you have a user that doesn’t use them at all, try asking them to perform another task, but this time limit their available input to just clicking on buttons (or whatever it is you’re testing).

This will enable you to observe both whether UI elements are naturally easy to learn and use and also that, if pushed, can they do the job intended on their own.

Don’t forget real users

It’s easy for conversational AI practitioners and conversation designers to assume that everyone know how to use voice assistants and chatbots. The reality is that, for an average person, they don’t have the same level of awareness or interest. Some people will need a little more hand-holding than others.

Some conversational user interfaces offer tutorials at the outset that coach users through using the application. This happens more in the voice assistant world than in the chatbot world, but can be handy for folks who aren’t familiar or comfortable using conversational applications.


Whereas Learnability is related to how quickly a user can learn to use an application, Efficiency of use is related to how quickly users can complete tasks using your conversational UI.

Things to consider and look out for in usability testing include:


Does your conversational user interface have a flat architecture, meaning that users can begin by expressing any of their needs and initiating any intent? Or do you have a tree-based architecture where you force users down specific paths?

TRY THIS: During testing, instruct users to speak in their own words and try to get through the interaction as efficiently as they can. Rather than doing as the bot asks them, tell them to use their initiative and act as though they were having a normal conversation with a human.

Prepare for things to go wrong here because they likely will. The main thing you’ll learn is how people speak when they’re trying to get things done, rather than following bot instructions.

Direct invocations

Do you allow users to use direct invocations and cut do the chase?

With voice applications, this includes being able to launch the application with a specific request, such as “Alexa, ask [My skill] for [Request]”, or initiating the interaction with a specific command, such as “Hey [Car assistant], turn on the heating and set the temperature to 18˚C.”

TRY THIS: When testing, give users specific tasks to complete and ask them to think of the quickest way to get started.

This will help them break out of the concept of following instructions and give you an insight into the language they use to describe what they’re trying to do, which is great designing and training data.


Does your application allow the user to take the conversation in the direction they want? Are you preempting the ways a conversation will move and moving along with it? Or are you constantly re-prompting until you get the response you need?

For example, if your restaurant booking chatbot asks:

“What time would you like a table for?”

And the user responds with:

“What time do you start serving?”

Are you able to handle that request and move back to your initial prompt?

TRY THIS: During testing, ask your users to try and ‘trip’ the bot up by asking related questions and answering prompts in ways that are still related to the use case, but that they wouldn’t expect the bot to be able to handle. For example, ask them to think of questions that they’d have for the bot in response to prompts it serves.

This’ll allow you to understand the various ways a conversation could go and enable you to plan for it.

In live applications, monitor your ‘no match’ utterances and to identify questions and anomalies that could be signs of users taking the conversation in another direction.

In general, to measure the efficiency of an application, keep an eye out for anything that adds additional steps in the journey or things that seem to prevent the user from accomplishing their task.


Memorability is concerned with how easily a user can resume or reuse an interaction without learning it again, as opposed to how well someone remembers the feelings brought about by the experience.

This is particularly difficult for voice applications where you don’t have a visual queue to remind users of what an application is capable of. And even harder for voice applications built on top of platforms like Amazon Alexa and Google Assistant, where users have to remember an invocation phrase on top of the wake word and task they’re trying to accomplish.

With voice, you’re fighting against the frailty of our natural short term memory.

Things to measure here include:

Wake word and invocation phrase memory

Whether you’re testing recall of a wake word for custom assistants or the invocation phrase of an Alexa skill or Google Assistant action, the thing we’re trying to learn is the same: can people remember and get started again quickly?

TRY THIS: if you were to ask a user research participant at the very end of your session ‘how would you access this voice assistant/application again if you needed to?’, what would they say? How they answer this question will tell you whether you’ve designed something that has memorability.

Use case and functionality memory

Though the welcome message needs to set the scene and communicate the scope of your application, once a user has been through you conversational interface, can they get started without being specifically prompted.

TRY THIS: Change the Welcome message for returning users and monitor the % of users who are able to get started through remembering how they got started last time.

Building on prior experience

Do users behave differently when they return? Do they speak more succinctly as they try to get through your options quicker? Do they experiment more with other phrases as they build confidence?

This is something to consider for high-touch use cases where repeat usage is likely. For example, restaurant bookings, delivery checking etc. As the user gets more confident, the bot should adjust in line with that.

Not all use cases require returning users. Some industries have a low number of interactions with customers and so how much time and effort is spent measuring Memorability will differ.

Though it’s best practice to make things smooth sailing for returning users and measuring how easy it is for those users to access and interact with your conversational user interface for a second time is a great way to do that.

Error rate

The error rate of the interface is concerned with how many user errors are made during the experience.

This isn’t backend errors like bugs or timeouts. These all need fixing as a matter of course. Instead, we’re concerned here with any ‘errors’ the user makes during the usage of our application that are incongruous with how we anticipated usage when designing it.

I put ‘errors’ behind quotations because I’m of the belief that there is no such thing as user error. User errors are indicators of design flaws and thus opportunities to learn how to improve the usability of our applications.

To monitor error rates, the main thing we’re concerned with is how often do we get a no match on our interaction model i.e. how many times does a user say something that we haven’t anticipated or catered for? How many times does a user hear a fallback or have to be re-prompted?

The two main errors you’ll encounter are:

  1. No input reprompt: if a user hears your no input reprompt, it means that they didn’t respond to your request within a given timeframe. When this happens, we need to ask ourselves: why does the user need so much time to think? How are we phrasing the initial prompt? Are we too vague? Are we asking too much? Is this a question that’s too complex for conversational interfaces?
  2. No match reprompt: if a user hears your no match reprompt, it means that they’ve responded with something that doesn’t match what you expected and specified in your language model. Here, we need to observe the type of no match. Is it something that should be in the language model and has been overlooked? Is it a new utterance that you can use as training data? Or is it something else unexpected?

One of the instructions that often comes from developers when they release software for testing is ‘try and break it’, meaning that if you try to break it and you can’t, it’s ready.

This is the same philosophy we should apply to the design of our conversational user interfaces. In particular, during testing, searching for the errors listed above is a great way of building out a robust design and language model. And this learning shouldn’t stop once you’re live. It should be something you’re constantly monitoring and iterating on.


User satisfaction is fairly self-explanatory. How satisfied are users with the interaction and experience?

Many companies use NPS (Net Promoter Score) to measure how likely a customer is to recommend the brand to a fried after experiencing the conversational user interface. This question can be included in your conversational applications and calculated on the backend. Or it can be gathered through another channel, such as SMS, after the user interaction is over.

Other metrics you can use to measure customer satisfaction include:

  • Percentage of completed conversations: the completion rate of your bot is the percentage of users who made it through to the end of an interaction. This can be monitored with tagging and monitoring users who trigger an intent that you specify as a ‘final’ interaction.
  • Percentage of live agent transfers: if your customers are being escalated to a live agent during a conversation that should be automated from end-to-end, that spells a problem and could be an indicator of a poor experience.
  • Specific customer feedback: rather than NPS, you can use other metrics that matter to you, such as asking explicitly ‘how can we improve this experience?’ or ‘rate this experience out of 10’ and other quantitative and qualitative questions that get insight into the areas your company deems important related to satisfaction.

TRY THIS: In your applications, ask; ‘did this answer your question?’ This is a great way of understanding the performance of your chat or voice bot. It also allows you to ask a follow up question if people say ‘no’ such as ‘How come? What answer were you looking for?’

Now you’re generating real customer feedback on how to improve your AI that you can feed into a separate customer insights dashboard.

Don’t expect everyone to answer that question, but that’s OK. All feedback is good feedback.

Putting it into practice

Nine times out of ten, your customer is just trying to get a job done and get on with their life. The more we can get out of the way and enable that to happen, the better our applications will be. This framework is some foundational considerations you can apply to do just that.

If you have any questions about the framework, or would like more information on how we can help you implement conversation design best practice in your organisation, feel free to reach out, hit us up on the live chat or subscribe to the VUX World newsletter below:

    The world's most loved conversational AI event is back
    This is default text for notification bar
    Share via
    Copy link
    Powered by Social Snap