Despite the revolutionary potential of voice UI, the user experience simply isn’t very good, argues Rebecca Sentance.

It’s been a little over six years since Apple’s iconic digital voice assistant, Siri, was unveiled with the launch of the iPhone 4S in October 2011.

Since that time, the buzz around voice commands and interfaces has been steadily increasing – particularly in the search industry (which I primarily write about for a living), where voice search is billed as one of the major trends that will shape the industry in the near future.

But how many of you reading this actually use voice commands to accomplish tasks in daily life?

If you voice search, “When was Siri launched?” on Google, your friendly voice assistant will helpfully read you the following snippet from Wikipedia:

Siri’s original release on iPhone 4S in 2011 received mixed reviews. It received praise for its voice recognition and contextual knowledge of user information, including calendar appointments, but was criticized for requiring stiff user commands and having a lack of flexibility.

This depiction – stiff user commands, accompanied by comical misinterpretation of the requests made of them – has become the prevailing stereotype associated with voice assistants like Siri, memorably parodied in this clip from CollegeHumor’s hilarious series ‘If Google Was a Guy’.

“Siri, how big is the Serengeti?”

“No problem. Show me pictures of spaghetti.”

This and other comic mishaps from the world of voice interfaces speak to what is probably the biggest obstacle standing in the way of their mass adoption. Despite the revolutionary potential of voice UI, the user experience simply isn’t very good.

To be fair, voice assistants have come on in leaps and bounds in terms of voice recognition accuracy over the past few years, with Google and Microsoft’s speech recognition accuracy nearing the coveted 95% benchmark: the level of speech recognition accuracy wielded by humans.

But voice interfaces have a number of other limitations, besides accuracy, which make the experience of using them less than optimal.

A conversational ‘uncanny valley’

Voice interfaces are designed to mimic human conversation, and so when we converse with a voice assistant, we enter a very different headspace to when we’re using a more obviously ‘inhuman’ interface on a computer screen or a keyboard.

The more conversational these interactions become, the more we expect digital voice assistants to act like humans, and the more jarring it is when they inevitably don’t – a sort of conversational uncanny valley.

Decades spent interacting with computers in a specific way has taught us what they respond to, what works; we aren’t going to attempt a keyboard shortcut that simply doesn’t exist.

But bring that interaction into the very human realm of conversation, and suddenly a whole new set of expectations comes into play. We want our voice assistants to think, respond and reason like humans – and the user experience predictably falls short.

Either over time, we’ll learn to modify our expectations of voice UI to something more realistic, or they’ll evolve to reach the extremely high bar that has been set for them (and then we’ll never need to bother interacting with another human being again).

Lack of choice

Closely linked with this is the issue that while voice interfaces give the illusion of human interaction, in reality they’re more like talking to a glorified phone tree, or at best an intelligent chatbot.

There is very limited flexibility to deviate from the preset script; users need to utter exactly the right words and commands to have their request recognised, making the experience painfully unintuitive.

Voice assistants don’t always retain conversational context like humans do, either (Image: /r/SiriFail)

Voice interactions tend to follow a very linear sequence, funnelling users down a default, pre-determined path: voice search typically surfaces the first result only, making purchases with Amazon’s Alexa means that your goods will be sourced only from Prime-eligible items, and so on.

We’ve explored previously on this blog why simplicity doesn’t necessarily guarantee usability, and why a lack of available functionality can often be more frustrating than useful. The minute that users can’t easily accomplish what they need to with an interface, they’ll be turned off it, no matter how efficiently it allows them to re-order paper towels.

Winning user trust

This leads us on to our final issue with voice UI: winning and keeping user trust.

In the right circumstances, voice is a very convenient tool, and it’s gaining more and more capabilities all the time. But it only takes one or two bad experiences for users to be soured on the idea of voice interfaces.

As Greg Hart, Amazon’s vice president in charge of Echo and Alexa, told Slate, building a voice assistant that can respond to every possible query is “a really hard problem”.

People can get really turned off if they have an experience that’s subpar or frustrating.

The article explains that Amazon tackled this problem by setting manageable expectations for the Amazon Echo’s capabilities: the Echo is shipped with a ‘cheat sheet’ of simple queries that the device can respond to. Perhaps this is where, in 2011, Apple went wrong: by implicitly inviting users to ask Siri anything, it set her up for a fall.

We know now that voice commands work best with very specific constraints in place, and adjusting our expectations for voice interfaces – along with improvements in the technology – has gone some distance towards winning back user trust.

Even so, voice has a number of UX barriers still to overcome before we’re likely to see it adopted on a truly widespread scale.

For an in-depth and entertaining guide to getting started with user research, read our free-to-download, comprehensive ebook ‘User Experience Research 101’

Download ‘UX Research 101’

Main image by Andres Urena