conversation design

A framework for consistently measuring the usability of voice and conversational user interfaces

A framework for consistently measuring the usability of voice and conversational user interfaces 1800 1200 Kane Simms

Answering one of the questions we get asked repeatably: how to measure the usability of conversational user interfaces, like chat bots and voice bots, and their impact on customer experience.

a framework for consistently measuring the usability of conversational user interfaces by Kane Simms

We’ve found Nielsen Norman group’s general usability measuring framework applies just as well in measuring the usability of voice and conversational interfaces and applications, too.

So here’s an overview of the framework and how you apply it to conversational AI. Next time you’re working on a chat bot or voice bot application, consider this throughout your design, development and usability testing phases.

Learnability

Learnability is concerned with how quickly a user can understand and use the application and, subsequently, how that contributes to how fast they can accomplish tasks.

You might think that, with conversational applications, learnability is built-in. Most people know how to talk or type and so what’s to learn, right?

Well, not quite. Learnability with conversational user interfaces can be measured by looking at things like:

1. Access

How easy is it to access your application? Does it require much effort? Is it intuitive?

Consider things like wake word detection. How easy is it to learn? The launch icon on your web page or app, is it obvious and easily recognisable?

TRY THIS: Present your solution in-situ during usability testing. If it’s a chatbot, put it on a webpage behind a chat icon. If it’s an Alexa skill, put it on a smart speaker for testing. This will allow you to understand how easy you application is for people to access.

2. Welcome message

Is your application clear about what it can and can’t do? Does it set the user expectations at the outset?

TRY THIS: Observe how users respond to the welcome message and measure the % of initial utterances that are ‘out of scope’. You can also ask users during user research interviews what their expectation of the application is after hearing the welcome message or description. This’ll help you understand whether you’re allowing people to get off on the right foot.

3. Prompts and responses

Is your conversational AI clear at every turn what’s expected of the user’s response? Do your prompts leave little room for interpretation and do they consistently solicit the kind of response you’d expect from the user?

TRY THIS: During testing, look out for any friction or confusion caused off-the-back of your prompts, as users progress through the conversation. Make a note of the prompts that solicit unexpected responses or that cause friction. Ask users after their interaction to elaborate on what they interpreted the prompt to mean and why they answered in the way they did.

This will help you understand how to craft a better prompt as well as give you understanding into additional directions the conversation might need to take.

4. UI elements

Do users make use of additional UI options? If you’re using screen displays for voice applications, do they help or hinder? If you’re using buttons and carousels in your chat bot, are they easy to navigate? Are they too constraining? Are users aware they can type or speak responses as well as click options?

TRY THIS: Allow users to test your application in the most natural and realistic situation for them. If they use smart displays at home, test on those. If they prefer chat, use that. Then, observe how users interact with the application without being prompted to determine how easy to learn your UI elements are. If you have a user that doesn’t use them at all, try asking them to perform another task, but this time limit their available input to just clicking on buttons (or whatever it is you’re testing).

This will enable you to observe both whether UI elements are naturally easy to learn and use and also that, if pushed, can they do the job intended on their own.

Don’t forget real users

It’s easy for conversational AI practitioners and conversation designers to assume that everyone know how to use voice assistants and chatbots. The reality is that, for an average person, they don’t have the same level of awareness or interest. Some people will need a little more hand-holding than others.

Some conversational user interfaces offer tutorials at the outset that coach users through using the application. This happens more in the voice assistant world than in the chatbot world, but can be handy for folks who aren’t familiar or comfortable using conversational applications.

Efficiency

Whereas Learnability is related to how quickly a user can learn to use an application, Efficiency of use is related to how quickly users can complete tasks using your conversational UI.

Things to consider and look out for in usability testing include:

Architecture

Does your conversational user interface have a flat architecture, meaning that users can begin by expressing any of their needs and initiating any intent? Or do you have a tree-based architecture where you force users down specific paths?

TRY THIS: During testing, instruct users to speak in their own words and try to get through the interaction as efficiently as they can. Rather than doing as the bot asks them, tell them to use their initiative and act as though they were having a normal conversation with a human.

Prepare for things to go wrong here because they likely will. The main thing you’ll learn is how people speak when they’re trying to get things done, rather than following bot instructions.

Direct invocations

Do you allow users to use direct invocations and cut do the chase?

With voice applications, this includes being able to launch the application with a specific request, such as “Alexa, ask [My skill] for [Request]”, or initiating the interaction with a specific command, such as “Hey [Car assistant], turn on the heating and set the temperature to 18˚C.”

TRY THIS: When testing, give users specific tasks to complete and ask them to think of the quickest way to get started.

This will help them break out of the concept of following instructions and give you an insight into the language they use to describe what they’re trying to do, which is great designing and training data.

Fluidity

Does your application allow the user to take the conversation in the direction they want? Are you preempting the ways a conversation will move and moving along with it? Or are you constantly re-prompting until you get the response you need?

For example, if your restaurant booking chatbot asks:

“What time would you like a table for?”

And the user responds with:

“What time do you start serving?”

Are you able to handle that request and move back to your initial prompt?

TRY THIS: During testing, ask your users to try and ‘trip’ the bot up by asking related questions and answering prompts in ways that are still related to the use case, but that they wouldn’t expect the bot to be able to handle. For example, ask them to think of questions that they’d have for the bot in response to prompts it serves.

This’ll allow you to understand the various ways a conversation could go and enable you to plan for it.

In live applications, monitor your ‘no match’ utterances and to identify questions and anomalies that could be signs of users taking the conversation in another direction.

In general, to measure the efficiency of an application, keep an eye out for anything that adds additional steps in the journey or things that seem to prevent the user from accomplishing their task.

Memorability

Memorability is concerned with how easily a user can resume or reuse an interaction without learning it again, as opposed to how well someone remembers the feelings brought about by the experience.

This is particularly difficult for voice applications where you don’t have a visual queue to remind users of what an application is capable of. And even harder for voice applications built on top of platforms like Amazon Alexa and Google Assistant, where users have to remember an invocation phrase on top of the wake word and task they’re trying to accomplish.

With voice, you’re fighting against the frailty of our natural short term memory.

Things to measure here include:

Wake word and invocation phrase memory

Whether you’re testing recall of a wake word for custom assistants or the invocation phrase of an Alexa skill or Google Assistant action, the thing we’re trying to learn is the same: can people remember and get started again quickly?

TRY THIS: if you were to ask a user research participant at the very end of your session ‘how would you access this voice assistant/application again if you needed to?’, what would they say? How they answer this question will tell you whether you’ve designed something that has memorability.

Use case and functionality memory

Though the welcome message needs to set the scene and communicate the scope of your application, once a user has been through you conversational interface, can they get started without being specifically prompted.

TRY THIS: Change the Welcome message for returning users and monitor the % of users who are able to get started through remembering how they got started last time.

Building on prior experience

Do users behave differently when they return? Do they speak more succinctly as they try to get through your options quicker? Do they experiment more with other phrases as they build confidence?

This is something to consider for high-touch use cases where repeat usage is likely. For example, restaurant bookings, delivery checking etc. As the user gets more confident, the bot should adjust in line with that.

Not all use cases require returning users. Some industries have a low number of interactions with customers and so how much time and effort is spent measuring Memorability will differ.

Though it’s best practice to make things smooth sailing for returning users and measuring how easy it is for those users to access and interact with your conversational user interface for a second time is a great way to do that.

Error rate

The error rate of the interface is concerned with how many user errors are made during the experience.

This isn’t backend errors like bugs or timeouts. These all need fixing as a matter of course. Instead, we’re concerned here with any ‘errors’ the user makes during the usage of our application that are incongruous with how we anticipated usage when designing it.

I put ‘errors’ behind quotations because I’m of the belief that there is no such thing as user error. User errors are indicators of design flaws and thus opportunities to learn how to improve the usability of our applications.

To monitor error rates, the main thing we’re concerned with is how often do we get a no match on our interaction model i.e. how many times does a user say something that we haven’t anticipated or catered for? How many times does a user hear a fallback or have to be re-prompted?

The two main errors you’ll encounter are:

  1. No input reprompt: if a user hears your no input reprompt, it means that they didn’t respond to your request within a given timeframe. When this happens, we need to ask ourselves: why does the user need so much time to think? How are we phrasing the initial prompt? Are we too vague? Are we asking too much? Is this a question that’s too complex for conversational interfaces?
  2. No match reprompt: if a user hears your no match reprompt, it means that they’ve responded with something that doesn’t match what you expected and specified in your language model. Here, we need to observe the type of no match. Is it something that should be in the language model and has been overlooked? Is it a new utterance that you can use as training data? Or is it something else unexpected?

One of the instructions that often comes from developers when they release software for testing is ‘try and break it’, meaning that if you try to break it and you can’t, it’s ready.

This is the same philosophy we should apply to the design of our conversational user interfaces. In particular, during testing, searching for the errors listed above is a great way of building out a robust design and language model. And this learning shouldn’t stop once you’re live. It should be something you’re constantly monitoring and iterating on.

Satisfaction

User satisfaction is fairly self-explanatory. How satisfied are users with the interaction and experience?

Many companies use NPS (Net Promoter Score) to measure how likely a customer is to recommend the brand to a fried after experiencing the conversational user interface. This question can be included in your conversational applications and calculated on the backend. Or it can be gathered through another channel, such as SMS, after the user interaction is over.

Other metrics you can use to measure customer satisfaction include:

  • Percentage of completed conversations: the completion rate of your bot is the percentage of users who made it through to the end of an interaction. This can be monitored with tagging and monitoring users who trigger an intent that you specify as a ‘final’ interaction.
  • Percentage of live agent transfers: if your customers are being escalated to a live agent during a conversation that should be automated from end-to-end, that spells a problem and could be an indicator of a poor experience.
  • Specific customer feedback: rather than NPS, you can use other metrics that matter to you, such as asking explicitly ‘how can we improve this experience?’ or ‘rate this experience out of 10’ and other quantitative and qualitative questions that get insight into the areas your company deems important related to satisfaction.

TRY THIS: In your applications, ask; ‘did this answer your question?’ This is a great way of understanding the performance of your chat or voice bot. It also allows you to ask a follow up question if people say ‘no’ such as ‘How come? What answer were you looking for?’

Now you’re generating real customer feedback on how to improve your AI that you can feed into a separate customer insights dashboard.

Don’t expect everyone to answer that question, but that’s OK. All feedback is good feedback.

Putting it into practice

Nine times out of ten, your customer is just trying to get a job done and get on with their life. The more we can get out of the way and enable that to happen, the better our applications will be. This framework is some foundational considerations you can apply to do just that.

If you have any questions about the framework, or would like more information on how we can help you implement conversation design best practice in your organisation, feel free to reach out, hit us up on the live chat or subscribe to the VUX World newsletter below:

Are you over-polishing your chatbot?

Are you over-polishing your chatbot? 904 678 Kane Simms

When it comes to designing chatbots and voicebots, don’t over-polish your dialogue. read more

Conversation design best practice with Salesforce’s Greg Bennett

Conversation design best practice with Salesforce’s Greg Bennett 1800 1200 Kane Simms

Learn conversation design best practice with Conversation Design Principle, Salesforce, Greg Bennett. read more

The difference between chatbot and voice search refinements

The difference between chatbot and voice search refinements 876 657 Kane Simms

What’s the difference between how people use chatbots and search bars vs voice user interfaces and what does that mean for how you design interactions for each?

One of the big differences between designing for a voice user interface versus a chat user interface and one of the big kind of striking differences between how people use chat and text based interfaces including search boxes compared to voice is all to do with search refinements.

If your search on a retailer website, if you use natural language search on a retailer website and you search for something like “I’m looking for men’s summertime clothes” or “I’m looking for something to wear this summer.” “I’m looking for something to wear on my holiday” or any kind of natural language search like that.

If you don’t find anything off the back of doing that search then your search refinement will end up shortening your search phrase and you’ll make it more keyword-based: “men’s summer clothes”. You will refine it down to something shorter because we’ve been trained over decades about how to use search engines and how search engines work.

If I have an actual conversation, if I’m in a shop talking to a sales assistant and I say “I’m looking for some clothes” and they say “what do you mean?”, what I’m likely to do in that situation is refine my search, refine my phraseology.

But if I’m in person having a conversation, it’s likely to be a hell of a lot longer. And so instead of me just saying “men’s summer clothes”. I’m likely to say something like: “Well I’m going on holiday in a couple of weeks time, you know, it’s supposed to be really hot weather. I’m looking for some shorts and t-shirts that kind of stuff.”

So the utterance there is incredibly long because I’m adding a whole load more context to the discussion. I’m saying that we’re going on holiday. There’s some context. I’m saying it’s going to be hot weather. That’s inferred that I’m looking for hot summertime clothing. I give examples by saying shorts and t-shirts and I don’t need to say ‘mens’ because it’s implied by the subtext of the conversation given the person who’s actually having the conversation.

And so not only is there are additional information underneath the utterance but there’s also a hell of a lot more information in the utterance.

We’ve been trained over the years, lifetimes, of having conversations that if someone doesn’t understand you, you then elaborate so that you can add more context, more information, to help them understand.

In the voice context, if you’re using a shopping application or a shopping voice user interface and it asks you a question like “Do you want to know more about the red t-shirts or the blue t-shirts?”

With voice, you might say “Both”. Right, the utterance starts out being narrow and short, but if the system doesn’t understand you and it says, “I’m sorry. I didn’t understand that. Do you want red or blue?” You over-elaborate again because you’ve been trained in conversation to add more information so that the other person can understand you.

And so instead of saying “both” again, you’ll say “I need both the red and the blue”, “I want to know more about both the red and the blue” and your utterance becomes longer.

And so that’s one of the real things to pay attention to when you’re designing voice user interfaces is:

1) be clear about the way that you phrase the question and anticipate those kind of nuanced responses
2) be prepared, when you do have to repair a conversation, that sometimes the utterances that you’ll get in response might be a little bit longer and contain a little bit more information.

Of course, it does work the other way around. Sometimes people will start with a long search phrase, then realise the system’s not quite functioning properly. It doesn’t understand them. And therefore they’ll refine something to be a little bit shorter, but it’s not always the case and sometimes it is the inverse.

Conversational ear worms

Conversational ear worms 1800 1200 Kane Simms

What is the conversational equivalent of an ear worm?

An ear worm is a song that you just cannot get out of your head. It doesn’t matter how hard you try it just sticks in there.

If any of you have got kids then you’ll know exactly what it’s like to wake up at five o’clock in the morning, busting for the loo and you just cannot get that Peppa Pig song out of your head!

Musicians and music writers all over the world strive to create ear worms because if you can create an ear worm, then that’s job done!

My latest ear worm, I don’t see any reason why you should be immune to this is, Thomas the Tank Engine.

So I was thinking about that and I was thinking what’s the conversational equivalent of an ear worm?

We’ve all had conversations that we remember, some of us have had conversations that might have even been life-changing.

Does the same logic tie into conversations that we have with our voice assistants?

I remember the first time I asked Google Assistant for a football score and it played the sound of crowd cheering in the background. I still remember that today. It’s one of the best interactions I’ve had on Google Assistant.

And so we have the tools to create memorable experiences through a combination of conversation design and sound design and it doesn’t matter whether you’re a boring old insurance company or whether you’re a cutting-edge media outfit.

We all have access to the same tools and we all have the potential to create memorable and meaningful conversations.

So what’s the most memorable conversation you’ve had with your voice assistant, or the most memorable conversation you’ve had at all, and why?

Think conversation design is complex? You aint seen nothing yet

Think conversation design is complex? You aint seen nothing yet 1800 1200 Kane Simms

If you think conversation design is complex, you ain’t seen nothing yet. read more

Multi modal design with Google’s Daniel Padgett

Multi modal design with Google’s Daniel Padgett 1800 1200 Kane Simms

Google’s Head of Conversation Design, Daniel Padgett, shares how his team approach multi modal design across all Google Assistant-enabled devices.
read more

SMS conversation design with Hillary and Matthew Black

SMS conversation design with Hillary and Matthew Black 1800 1200 Kane Simms

Hillary and Matthew Black join us to share how to design and implement automated conversational AI with SMS messaging, and why you should. read more

Ethical conversation design with Microsoft’s Deborah Harrison

Ethical conversation design with Microsoft’s Deborah Harrison 1800 1200 Kane Simms

What is ethical conversation design? Why is it important? And what can we do to design conversations more responsibly? Join Deborah Harrison, Cortana’s first writer, to find out. read more

Conversation design and grounding strategies with Jon Bloom

Conversation design and grounding strategies with Jon Bloom 1800 1200 Kane Simms

Jon Bloom, Senior Conversation Designer at Google, joins us to share what a conversation designer does at Google, as well as some conversation design techniques used at Google, such as ‘grounding strategies’. read more