At SlatorCon London 2025, Slator’s Silvia Terribile hosted a panel discussion centered on the challenges and importance of prosody in AI speech.
AppTek.ai experts Volker Steinbiss and Florian Lux were joined by neurolinguist Dr Julia Schwarz from the University of Cambridge to address advancements and challenges in recreating prosody in synthetic speech, as well as the complexities of customer demands and interdisciplinary collaboration.
The discussion began with Schwarz describing the importance of prosody from a linguistic perspective for the audience as an element that structures speech rhythmically and melodically, encompassing intonation, amplitude, and duration as well.
From a neuroscientific perspective, explained Schwarz, prosody creates predictable patterns that human brains use for “neural entrainment,” i.e., aligning brainwaves with speech rhythms for efficient and correct processing. She added that, when prosodic cues are absent or significantly diminished, the brain struggles to effectively process and comprehend speech.
On the issue of the technical hurdles involved in replicating prosody accurately, AppTek’s Lux commented on how cascaded AI systems, which convert speech to text, translate it, and then synthesize new speech, inherently lose essential prosodic information during the text conversion step.
This is particularly problematic when translating between languages with vastly different prosodic structures, added Lux. And while a single, end-to-end AI model that directly translates speech could theoretically preserve prosody, Lux explained that “you need to have aligned speech input and speech output in the source language and target language, ideally with a matching speaker. And this kind of data simply doesn’t exist.”
“You need to have aligned speech input and speech output in the source language and target language, ideally with a matching speaker. And this kind of data simply doesn’t exist.” — Florian Lux, Speech Technology Scientist, AppTek.ai
Meeting Customer Demands
Steinbiss commented on customer expectations for AI-generated speech within the complexities of prosody and emotion, explaining that these greatly depend on the application, i.e., whether synthetic speech is used for a simple, short video or a major production.
For high-stakes content, viewers demand full immersion, meaning that even minor prosodic glitches can be jarring, highlighted Steinbiss. In contrast, for applications like news content, the quality bar is lower.
Steinbiss advised against automating “everything.” To him, a rule of thumb would be to always include a human-in-the-loop, including a highly granular post-editing phase. For AI speech, content owners should start with something simple, such as a talking head or flat speech, he suggested.
The Interdisciplinary Puzzle
Cracking the code of prosody is inherently an interdisciplinary challenge, and the experts agreed that there is no single “magic” prosodic cue.
On the linguistic side, Schwarz commented on prosody’s deep link to language meaning. “It’s not something that’s regularly taken into account in speech models. So that’s a big challenge, but I think it would improve the models massively if it could be done,” she added.


Lux concurred that insights from disciplines like linguistics and neuroscience/cognitive science are indeed very valuable. “We want to make use of this knowledge, but we have to somehow turn this now into code that runs on a server.”
Schwarz added that the intricate mix of prosodic cues is not universal. It differs significantly across languages. Consequently, researchers and developers striving for truly natural-sounding AI speech cannot rely on a singular solution.
What Research Reveals
Preliminary findings from the ProsodAI research project at the University of Cambridge are shedding light on how humans process AI-generated speech. Schwarz shared that the aim of the project is to gain an understanding of how humans process AI-generated speech since it is something new that is becoming prevalent.
The project, which is a work in progress, is divided into two parts. One part deals with behavioral factors, distilling what sounds “natural” to listeners, and the other one factors in neural data, discerning how the brain rhythms align themselves when prosody is not quite right.
“Unsurprisingly, all the models do pretty well in intelligibility,” commented Schwarz, with models that better imitate prosody consistently rated as more natural.


Interestingly, listener background and expectations also play a role. For instance, British English speakers in the study showed a clear bias towards British-accented AI voices, finding them more natural. They also reacted less negatively to compressed melodic variability in male speakers, potentially due to societal exposure to less expressive male voices, commented Schwarz.
The panelists also emphasized that AI dubbing is rapidly expanding the volume of content that can be localized. As AI speech continues to advance, the focus is shifting from simply making speech intelligible to making it truly natural and emotionally resonant.
This “next problem to solve” aims to reduce post-editing efforts and make foreign language content more accessible and affordable. The overall goal is to develop systems that not only speak clearly but also convey the full spectrum of human emotion and intent, making AI voices indistinguishable from our own.