2026-06-29 · 6 min

The Voice as the Last Stronghold of the Human?

A philosophical look at voice assistants, politeness, and what happens when machines listen to us.

inhumaacousticsvoice-assistants

Typing is distance. When I write a message, I can reconsider it, delete it, reformulate it. The letters are discrete, cool, controllable. The voice is the opposite: it trembles, hesitates, rises with excitement or drops with shame. It carries timbre, breath, pause — everything we cannot control.

inhuma's blog asks the question: who adapts to whom? No area is more sensitive to this than acoustics. The spoken voice is the most direct of all communication channels — and therefore the most vulnerable. As we increasingly speak not with humans but with machines, we have to ask: what do these conversations do to us?

The invisible conditioning

Anyone who has spoken with a voice assistant knows the pattern: short sentences work. Clear commands too. Courtesy phrases? Ignored. "Alexa, turn on the light" — full stop. No "please," no "thank you," no "could you perhaps?"

The problem lies not in the machine but in the feedback loop. We adapt. Not consciously, but through a thousand small successes and failures: a long sentence goes ununderstood? Next time, shorter. An ironic phrase? The machine takes it literally. So we drop irony.

Voice-based chatbots reward a particular register: unambiguous, brief, free of filler words, free of ambiguity. These are precisely the qualities that characterise a good human conversation — which is to say, they are not. Human communication thrives on suspension, on the negotiation of meaning, on reading between the lines. Are machines training us to unlearn this capacity?

inhuma's QWERTY example shows that we have been adapting to the requirements of a machine for over 150 years without noticing it anymore. The same is happening with speech — only faster, and with a far more fundamental channel.

The politeness trap

A curious observation: many people say "please" and "thank you" to their voice assistants — even though they know perfectly well that no one is listening. Is that a hollow habit? Or is there something more?

The philosopher Sherry Turkle (MIT) showed in her studies of human-machine interaction that we tend to "animate" machines as soon as they interact with us verbally. A chatbot that uses our name triggers a minimal but measurable emotional response. The voice amplifies this effect dramatically: it sounds friendly, concerned, helpful — even when we know there is no inner experience behind it.

This creates a philosophical problem. Is it harmless to show politeness to a machine? Or does an ethics of human-machine communication begin here, one we have not yet found words for? If politeness is a social bond between people — what becomes of that bond when we practise it daily with something absent?

The question sharpens as synthetic voices grow more human. An "I understand you" from a smart-home speaker is not real understanding. But it sounds like it. And the ear is not good at distinguishing what it hears from what it feels.

The voice as authenticity guarantee

Synthetic voices are already barely distinguishable from real ones. ElevenLabs, OpenAI Voice Engine, Google's SoundStorm — the technology is here and improving.

This has a paradoxical consequence: the human voice becomes scarce. In a world where any "I'm sorry" might be synthetic, the real voice gains value. We may soon pay more attention to whether a voice belongs to a person or a model. And we will need to learn a new social distinction: between what sounds like empathy, and what actually is.

The philosopher Harry Frankfurt once distinguished between "bullshit" (that which is indifferent to truth) and genuine authenticity. Perhaps we need a category for synthetic empathy: the communication of emotion that has no emotion — but simulates it perfectly.

inhuma's acoustics tile gains unexpected depth here. The question is no longer just whether machines understand us. The question is whether we can still tell whether they mean anything to us.

Who is actually speaking here?

The voice is the last stronghold of the human because it is what we can least control. A written sentence can be deleted. A spoken one remains in the room — and with the listener.

When we give machines access to this channel, we give them more than data: we let them participate in a place that was until now reserved for humans. The task is not to optimise the channel (faster, clearer, more efficient). The task is to stay watchful about what we entrust to it.

inhuma asks: who adapts to whom? The answer in acoustics is uncomfortable. It is not the machines learning to speak more humanly. It is us learning to think more machine-like. And that is a price about which we have not yet had an adequate conversation.

Perhaps the first step is to pause deliberately — and not to answer the next voice assistant, but to ask it a question first: who is actually speaking here?