Why we need a no-code multimodal orchestration tool

Ben McCulloch
January 19, 2023
in Article, Opinion

Why we need a no-code multimodal orchestration tool https://vux.world/wp-content/uploads/article.png 1120 840 Ben McCulloch Ben McCulloch https://secure.gravatar.com/avatar/b1f3549c2d953651d69f59ec1fa801a3?s=96&d=blank&r=g January 19, 2023 January 19, 2023

During my chat with Jason F. Gilbert, he expressed frustration at how hard it is to orchestrate robotic movements with voice, sound effects, lights and other modes of expression.

This matters, and not just for Jason who is the Lead Character Designer at Intuition Robotics, where he works on ElliQ – ‘the sidekick for healthier, happier aging.’

There are many more designers working in robotics. The same challenge will face conversation designers who work on digital humans as well – they also use various modalities to express themselves.

How can we design characters that use all their modes of expression to convey meaning? That’s what humans do. Words are only one component of how we communicate – we use much more to express ourselves.

No tools for the job

Here’s what Jason said (edited for brevity):

“There’s not a single platform for designing multimodality experiences on robots. I’ve talked a lot about this. I’ve asked a lot of people about this – people from different robotics companies. No one has this. Ideally, we would have some kind of Voiceflow, or some kind of no-code tool, where you can just go ‘okay, this is where the lights come in’. And now when [ElliQ] says this line, she also has this sound effect, this gesture and this thing, and that would be amazing. But you don’t have that [tool] right now.”

That’s a huge challenge for designers like Jason. Conversation designers should be able to orchestrate every facet of a bot’s expressive repertoire. That’s how we communicate! We don’t use words with a little garnishing of something else. Sometimes words are the garnishing in what someone means.

In order to design a bot’s expressive language, we surely not only have to consider what’s natural for a human to understand, but also what new possibilities a bot has for expression that humans don’t have (such as lights or the buzzes, whirs and clicks a robot’s motors make while they move).

Where are the solutions?

It’s also a huge opportunity for anyone who wants to design such a tool. It would need to allow designers to orchestrate everything a bot can do together, so that we can combine different modalities that let bots express themselves in various ways.

Our industry needs this. Bots and digital humans will be incredibly dull conversation partners if their facial expressions and body language express nothing. Without this tool the other modalities could easily just become window dressing if they’re not expressing what the bot means.

It could be worse if their various modalities contradict each other. Imagine if a user asks whether it’s time to take their daily medicine, and the bot nods it’s head (rather than shaking it) while saying “no”. Then they’ll confuse users, and it could be dangerous.

How do other industries do it?

It’s so easy to fall into the trap of thinking every challenge in conversational AI is brand new. That’s not always the case.

Animated films and videogames have workflows where every facet of a character is considered and brought to life, with varying results.

Check out this video on the making of Rango. Whereas actors will usually just be brought in to replace temporary voices for animated films, on Rango the actors were filmed together on a motion-capture stage acting out every scene. Those performances were the source materials for the animation, so each actor’s body movements, facial expressions and voice were captured before being applied to their CGI character. As you can see, it brings the characters to life and they’re so expressive!

Compare Rango to Fireman Sam. Both are CGI, but watching Fireman Sam is like watching wooden marionettes. Their body language is often redundant. Their facial expressions say very little. The character’s potential for expression with their bodies and faces hasn’t been exploited at all.

Why can’t we make bots and digital humans that are as expressive as a character from Rango?

What’s suitable for our industry?

Of course, most of us aren’t making entertainment products. Our bots often have important roles to play. They have to empathise. They have to build trust. They have to sell things. They have to represent a brand and its values.

You could say in those cases the stakes are higher than with entertainment. When someone watches a film they don’t like, they can ask for a refund or moan on social media, and then the story is over. On the other hand, if someone has an underwhelming conversation with an AI assistant, then they might never talk to it again or stop dealing with that brand.

For a companion bot such as ElliQ trust is paramount. The user and bot communicate with each other, and a relationship grows from those exchanges.

So, where’s the tool to help conversation designers orchestrate a bot’s multimodal expressions? We’re not animators and we don’t have mo-cap studios or actors. We shouldn’t have to learn every trick a CGI animator knows to do this, and yet it’s our job to create excellent communicators.

Someone get on it! We need this.

Here’s my full interview with Jason – he gives many great insights.

You can also check out Kane’s interviews with Stefan Scherer and Danny Tomsett for more on designing for robots and digital humans.

This article was written by Benjamin McCulloch. Ben is a freelance conversation designer and an expert in audio production. He has a decade of experience crafting natural sounding dialogue: recording, editing and directing voice talent in the studio. Some of his work includes dialogue editing for Philips’ ‘Breathless Choir’ series of commercials, a Cannes Pharma Grand-Prix winner; leading teams in localizing voices for Fortune 100 clients like Microsoft, as well as sound design and music composition for video games and film.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
lang	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
resolution	session	This is a functionality cookie used to collect the horizontal value of the visitor screen resolution. It helps in optimizing the website view to the user.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_111445333_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
ajs_anonymous_id	never	This cookie is set by Segment.io to check the number of ew and returning visitors to the website.
CONSENT	16 years 2 months 25 days 18 hours	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.

Cookie	Duration	Description
bscookie	2 years	This cookie is a browser ID cookie set by Linked share Buttons and ad tags.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__smSessionId	9 hours	No description available.
__smToken	1 year	This cookie is set by the Sumo. This cookie is used for verifying whether the user is logged in or not.
__smVID	1 month	This cookie is set by Sumo. The purpose of the cookie is not yet known.
_mailmunch_visitor_id	never	This cookie is set by MailMunch which is email collection and email marketing platform. We do not know the exact purpose of the cookie.
AnalyticsSyncHistory	1 month	No description
attribution_user_id	1 year	This cookie is set by the provider Typeform. This cookie is used for Typeform usage statistics. It is used in context with the website's pop-up questionnaires and messengering.
cookielawinfo-checkbox-functional	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
debug	never	No description available.
intercom-id-or0x2acp	8 months 26 days 1 hour	No description
intercom-session-or0x2acp	7 days	No description
li_gc	2 years	No description
li_sugr	3 months	No description available.
mailmunch_second_pageview	never	This cookie is set by MailMunch which is email collection and email marketing platform. We do not know the exact purpose of the cookie.
UserMatchHistory	1 month	Linkedin - Used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.

Why we need a no-code multimodal orchestration tool

No tools for the job

Where are the solutions?

How do other industries do it?

What’s suitable for our industry?

Stop getting ChatGPT and GPT3 confused!

Master assistant or multi-assistant orchestration? 4 considerations