How do we interact with sound?

The Interaction

Defining the Paradigm for Sound-based Interactions

On Voice Assistants

Voice interfaces on personal computing devices aren’t new. The late 2010s were the golden years of the Home Voice Assistant (Alexa, Google Home, Apple HomePod), and the early 2010s saw the rebirth of Siri and the creation of Google Assistant. While two-way conversational interactions feel like the most natural way to model this interaction (you speak to someone and you listen for a response) there are a few reasons this shouldn’t be the dominant way that people interact with their devices.

The first is that it takes time to formulate a prompt and then perform it vocally in a way that is undoubtedly clear and without error. Make one mistake and you’ll have to start the sentence all over again. Time also comes into play when the assistant needs to send the prompt to its servers, process the query, and then perform the response back to the user.

Voice inputs are also not designed for the act of browsing, only single laser focused queries for specific actions to be done or information to be retrieved. In the context of face-to-face conversations, personal anecdotes and details only come through many layers of conversational digging which require much lower latency and mental preparedness (you need to be in the mood for it).

Voice assistants, and even just the concept of having assistants, are fraught with the IRL analog of patriarchal corporate structure. A question of digital etiquette has arisen in the past, questioning if the way we speak to our inanimate, unfeeling (hopefully) devices makes us ruder when dealing with real people who are in the act of assisting us. USA Today and TechRadar have both run op-eds about this phenomenon and personally I think it’s something that we try our best not to emulate.

All that emotion in your voice is wasted on speech recognition models who just want the word and not the feeling.

The last reason I believe is the embarrassment that comes from using a voice assistant, especially in public. Often when I’m on a bike barrelling down a hill and need to engage with Siri I will try my best to utter the trigger phrase under my breath, lest a passing pedestrian hears me. When it inevitable doesn’t trigger by the third or fourth time I simply give up and resolve to check manually at the next red light. Perhaps this is more or a personal problem, but being caught being ignored by my phone is objectively tragic, and so is talking at an audible volume unprompted and out on context.

This is not a voice controlled user interface. The input controller will be jogwheel/slider based with a tactile response. Voice and sound will be the media that goes in and out and is not to be tampered with or used to control elements of the interface. Listening will be the primary mode of engagement.

On DTMF Auto Attendants

Before virtual assistants the most common way of interacting with voice/sound based automated systems was through DTMF (Dial Tone Multi Frequency). Although it was ultimately replaced with voice prompt recognition, DTMF was the natural evolution of using pure audio channels to transmit information and useful interaction over the phone. Relying on phone lines does mean there are only two channels through which information can move (audio coming in, audio going out). Moreover these two channels are reliant on time, not only to be synchronised, but in purveying information. Auto Attendant menus are all linear in nature, in order to pick the last option, all options must first be read out verbally.

On one hand this makes sure that the user knows all options available, but on the other makes it less responsive and more frustrating to use. Only power users who remember extensions truly benefit from this set up, but also questions why they couldn’t just dial the whole extension from the get-go.

On Operators

Of course before any phones had buttons or rotary encoders or voice recognition, phone operators were the way people engaged with tele-voice. Despite the lack of peripherals, speaking to an operator was (and still is) the preferred way to get connected to the right people. Operators were respected people on the other end of the line, and it’s speaking with a real human being that feels the most natural to most. While the scale of demand and cost-cutting has rendered the operator obsolete they do teach us one thing, that speaking is best when engaging directly with a listener and vice versa.

On Radio

Listening to radio never feels like an incomplete experience despite it not having supposedly desirable on-demand qualitites. If streaming services only played songs from halfway through I think I would riot. However the type of interaction that this allows is for swift channel surfing. Ideally you’d want to be in the middle of the song so that you know exactly what the vibe of the playlist is. Often it is the song playing that decides whether I commit to a radio station rather than the station code. For me, coming in halfway through or walking into something that is already in motion is a more honest impression of what the station is all about.