A code leak confirms Sonos voice assistant is coming. Will this be the first voice assistant with true privacy at the heart?
Evidence of a Sonos voice assistant
A Reddit user found some hidden code in the Sonos app that suggests a voice assistant, named SVC (Likely Sonos Voice Control), is on its way.
One of the things that the Sonos voice assistant is likely to differentiate itself on is privacy and security, given its underlying technology (which we’ll come to). This could mean that Sonos becomes the first genuine contender to rival Google and Amazon in the smart speaker space.
The importance of privacy for voice assistant users
In a recent study conducted by Vixen Labs, it found that 52% of people in the US, UK and Germany are concerned about privacy and security when it comes to using voice assistants.
Even people who use their voice assistants every day are in some way concerned with privacy and security. It’s one of the hottest topics and biggest barriers to consumer adoption of voice user interfaces. That, and accuracy (“It never understands me”).
Building trust is just part of the natural technology adoption cycle
This is, in part at least, the nature of emerging and fledgling technology adoption. We all remember the days when we’d book a restaurant table online or buy something via an app, then call the company on the phone to make sure it’d all gone through! This apprehension and uncertainty around wider usage of voice assistants is the same.
Privacy and security are two of the main reasons some people still don’t bank online. My in-laws only started grocery shopping online because the pandemic made the perceived risk of shopping online less than physically entering a COVID-riddled shop.
So privacy and security concerns with changing behaviour aren’t new. So why is voice different?
Why privacy concerns exist with voice user interfaces
Well, firstly, it’s a totally different and new type of interface.
When you can’t see anything, it makes the whole interface difficult to grasp. Yes, it’s potentially easier to speak than it is to tap or swipe. And, yes, we’re used to speaking, so it should be easier to adopt, but that’s not strictly the case.
The historic confidence of screens
Just because you can talk to something, doesn’t mean that it’s easy to use. There are a whole bunch of things that a screen gives you that voice doesn’t.
You get a big green tick when something has been successful. You get a big red cross when something goes wrong. Even a page reloading can help users orient themselves in an experience.
Plus, people have been using screens for years and are now accustomed to the mental model and interface standards.
Buttons can be clicked. Boxes can be typed into. Some words can be clicked in some sentences. We know that, and so we understand it.
We also know what’ll happen when we perform certain actions. If we click a link, it’ll open another page. If we click a button, something will happen. We’ve performed those actions so often that we’re completely confident in the cause and effect of our interactions.
For new users, voice doesn’t always come naturally
With voice, all you have is sound. Sound is temporary and easy to miss. Was that a successful payment made or did it bounce? What did my assistant ask me again?
And if you do mishear or miss something, what do you do? What’s the verbal equivalent of reloading a page, scrolling back to the top or clicking the ‘back’ button?
Then there’s the lack of understanding about what’s actually possible to do in a given interaction. If I say ‘repeat’, will it actually repeat? Can I say anything to this thing or is it listening for specific ‘key words’ or ‘trigger phrases’? And what are those key words?
All this is to say that, for new users of voice interfaces, it doesn’t come naturally all the time. This leads to some uncertainty. And it’s hard to place your trust in something that’s so uncertain.
Uncertainty over an interface type decreases trust in that interface
Then, when you’re asked a question in a survey about privacy and security; these are principally trust questions. Do you trust that your data is being held securely? Do you trust that your conversations are kept private?
If you’re uncertain about how a certain device or technology works, you inevitably lack confidence in using it. And if you’re full of uncertainty and lack confidence, then you’ll not trust it. If you don’t trust it, you’ll obviously have concerns about whether your data is kept secure or private by proxy.
So part of the privacy and security issue is a lack of confidence and understanding of the interface itself.
Compounding the uncertainty
And it’s not just the interface type that’s uncertain for new users. It’s also the lack of understanding about the technology and what’s happening behind the scenes. Left to the inexperienced imagination, you can conjure up all kinds of possibilities. Some people genuinely believe that Jeff Bezos is sitting on his yacht listening to Alexa recordings.
And because you don’t know what’s happening behind the scenes, it’s easy to react poorly when you learn that contractors in the Philippines have been listening to some of the things you said to your phone; your most personal device.
Within the conversational AI, speech technology and NLP industry, it came as no surprise that folks were listening to recordings. It’s the only way you can train an ML system to improve. Review the ASR and NLU performance and make improvements, so that next time someone asks ‘how old is Jeremy Kyle?’, the system can answer.
Building trust will take time
So it’s clear that building trust will take time as we collectively build new mental models around interacting with voice assistants and voice user interfaces; as the industry learns and eventually settles on interactions patterns, and as the public’s collective knowledge on the technology and how it works increases.
Short term privacy solutions
One thing that can be done in the short terms to help people build a little more trust and confidence in using their voice as an interface is to make sure that speech data isn’t sent to the cloud for processing. That means that none of your spoken audio travels over the internet and risks being intercepted. It means that none of that audio is stored on a server somewhere for contractors to listen to. And it means that you can have confidence in the fact that everything you say to your device stays on your device.
Google and Apple have already made some moves to make this happen. Google announced at I/O ’20 that a large degree of its speech recognition models can be run on device on Pixel phones using its VoiceFilter-Lite system. Apple announced the same this summer, but restricted to devices using the A12 bionic chip and running iOS 15.
This means that you can speak and dictate to your phone, or use the voice assistant, and none of that data is sent to the cloud for processing.
So it’s happening, but it’s all very new to Google and Apple, and there’s no sign of it from Amazon, the market-leading smart speaker provider.
Making on-device speech recognition and natural language processing work
It’s ‘edge’ computing models meant that you could, for example, ask your coffee machine to make you a coffee, and that request would be fulfilled without an internet connection.
Snips was acquired in November 2019 for $37.5m by none other than, Sonos.
Sonos voice assistant and privacy by design
So we know now that Sonos is working on a voice assistant, we know that the code exists in its app, and we know that the underlying technology is Snips; privacy first.
Therefore, if there’s one bet you could place, it’s that the Sonos voice assistant will run the vast majority of its processing on-device.
This would make it a viable alternative to Amazon’s Echo and Google’s Home smart speakers, and would place the Sonos smart speaker in real competition.
Why Sonos can seriously compete with Amazon, Google and Apple
Music has consistently been the top use case for smart speakers since the launch of the Amazon Echo 7 years ago. And we know that privacy is a top concern for users. A smart speaker, that sounds as good as Sonos (a well trusted and respected audio brand) that also has privacy and security at the absolute centre, with on-device processing is a no brainer for some.
Smaller language models mean complete on-device capabilities
While Amazon and Google will get to entirely on-device processing eventually, it’ll take them a lot longer. They have extremely wide and broad speech and language models. Alexa needs to transcribe all kinds of requests and accurately classify them. Everything from timers to music to calendars to news to skills to recipes to questions, you name it. Alexa, Google and Apple have to use the cloud to process this in many cases because they need the computing power to do so.
Sonos only needs to be able to understand music requests. And although that’s not a small task – Google found that there are over 5,000 ways to say “set an alarm” – nevertheless, it’s a much easier task than to understand everything.
Sonos is in a strong position
That means that Sonos, with its music-specific language model will only need to send specific song titles, artist names, playlist names etc, to the cloud in order to play the right song. And it’ll only need an internet connection for the music streaming element, and to push language model updates for things like new song or album titles, artist names and playlists.
Up until now, privacy focused voice processing has been reserved for specific devices produced by the largest companies in the world. Now, there’s a true contender. And it could help increase confidence and build trust in voice user interfaces in general, which would be good for all of us.