Speech AI has 10x’d: purpose, evaluation, and products

This post is for anyone building AI models, rethinking human-computer interaction, or just wanting to build really great products. It explores the implications of advanced speech capabilities in machine learning, drawing on my five-plus years of experience and conversations building AI-native products.

AI speech has a significant capability overhang—its model capabilities far outpace their current user and economic impact. We now have the potential to generate expressive speech that can unlock transformative user experiences. From model development to product design, R&D teams must now confront the implications of computers that can truly communicate verbally.

Speech is one of our most fundamental forms of communication, yet it’s often treated as just another feature in modern products. While voice interfaces are widely recognized as key to the future, we continue to evaluate AI speech with outdated frameworks, akin to 1990s text-to-speech. This disconnect between speech’s potential and how we build and measure it is holding back the next wave of human-computer interaction.

OpenAI’s GPT-4o Advanced Voice Mode was the first widely available model to showcase multimodal speech capabilities, proving that The Bitter Lesson applies to speech: progress stems from scaling and generalization. But intelligence is no longer the moat—it’s a commodity. The real edge lies in how products leverage these capabilities, redefining speech as a dynamic communication medium.

Speech isn’t just about words—it’s about how they’re said. Prosody transforms meaning, intent, and emotion. This image captures its impact better than any explanation I’ve come across:

We’ll focus on the fundamental purpose of speech: enabling and enriching communication. From theory to practice, we’ll explore how modern speech synthesis solves real-world problems and creates value for businesses and users.

Speech as communication

To build great speech products, we need to understand the ‘job to be done’ of speech: communication. This starts with understanding the anatomy of speech events.

Every instance of spoken communication is a speech event—a single utterance or sentence conveying meaning. Each speech event contains two key components: lexical information (the words, or what is said) and prosodic information (the rhythm, pitch, and tone, or in other words how it’s said). Together, these elements create the message and effectively convey meaning.

A single message without context is generally insufficient to communicate all the information. Who is the intended receiver? What is the speaker responding to? Through what medium is the speech delivered? Linguistics and information theory provide frameworks to help us navigate these complexities when deploying intelligent speech systems.

Roman Jakobson’s model of speech events outlines six interdependent components—sender, receiver, context, message, contact (channel), and code—that together create meaning. A breakdown in any component can disrupt communication. For instance, if the code (shared language or conventions) doesn’t align with the context (situation or cultural backdrop), the receiver may misinterpret the prosody or intent within the message.

This interplay of factors is crucial for AI speech systems. Success doesn’t come from generating natural-sounding audio—it’s about harmonizing these components to ensure speech achieves its communicative purpose, whether guiding a driver or conveying emotion in a localized dramatic video.

A great starting point for practical exploration is mapping the components of a speech event to the product interaction your user will experience. Use the diagram below to guide this process. Keep in mind, each speech event typically conveys a single message, but user interactions often build into either a conversational exchange (back-and-forth utterances) or a sequence of utterances (such as delivering a set of instructions).

Jakobson’s model offers a solid foundation but it doesn’t fully address how we assess the success of a speech event in practice. The diagram above hints at a positive or negative evaluation, but communication is rarely binary—the receiver may grasp parts of the message or interpret unintended meanings.

For example, a speaker might aim to communicate sincerity or reassurance but unintentionally convey doubt or indifference through their prosody. This challenge is equally relevant for synthetic speech. AI-generated speech must carefully balance prosodic and linguistic elements to avoid undermining the intended communicative goals, ensuring that it conveys both the desired message and tone effectively.

To better understand these challenges, Speech Act Theory adds nuance by highlighting three layers of communication:

• Locutionary Force: The literal meaning of the words

• Illocutionary Force: The intent behind the message

• Perlocutionary Force: The effect on the listener

Synthesized speech must succeed on all three levels. It’s not enough to “sound human”—systems must express intent and achieve the desired impact, whether resolving a frustrated customer’s issue with empathy and precision or making a user feel understood during a virtual conversation.

Speech products in practice

Let’s apply these observations to three product categories where speech is making a significant impact, each with unique challenges, opportunities, and economic implications.

AI dubbing (e.g. Papercup)

AI dubbing is one of the most demanding speech synthesis applications because it must replicate, not originate, expressivity. This requires transferring lexical and prosodic information across languages while preserving the illocutionary force (intent) of the original speech.

Traditionally, prosody transfer reproduced the rhythm, pitch, and tone of source speech directly, often failing to meet cultural expectations. The shift toward prosody translation adapts delivery to the target language and audience. For example, a passionate monologue in English may need a more restrained tone in German to convey authentically. This evolution ensures localized content feels natural, emotionally resonant, and culturally aligned.

Mapping navigation (e.g. Google Maps, Apple Maps)

Navigation systems rely on flat, monotonous delivery, leaving users to infer urgency or intent from words alone. This increases cognitive load and leads to frustration (speaking from experience!).

Dynamic expressivity offers a solution by adapting tone to context. For instance, a firmer tone for “Turn left in 200 meters” can signal urgency, while a calmer tone for “Turn left in 5 kilometers” allows for deliberate planning. By aligning speech with context, navigation systems can improve clarity, reduce cognitive effort, and enhance trust.

Digital avatar videos (e.g. HeyGen, Synthesia)

Digital avatars are revolutionizing video production by reducing costs and enabling rapid localization. These systems can generate videos in multiple languages from a single creation, but localization requires adjustments to the code to reflect linguistic and cultural conventions across regions.

In digital video, alignment between avatars and synthetic speech is crucial. Flat, robotic voices paired with hyper-realistic visuals—or expressive voices with stiff animations—diminish user experience. A safety training video calls for calm authority, while sales enablement content thrives on dynamic, engaging delivery. Cohesion across all elements is vital.

These examples hint at how we should think about enabling speech synthesis as a transformative layer in AI-native products. Realizing its full potential requires evaluation methods that move beyond naturalness to measure how effectively systems communicate and deliver value.

Evaluation as a key enabler

We need to move beyond the simpler world of TTS (text-to-speech), which no longer reflects the reality of intelligent, multimodal systems. These models now integrate diverse inputs and outputs, including text, images, and audio. Advancements in text-to-image generation demonstrate the growing versatility of multimodal research. Modern speech synthesis leverages this progress to produce expressive, contextually rich speech. Clinging to the TTS label underestimates these systems’ potential and narrows our vision for their future applications.

Beyond multimodality, advances in prosody are the hallmark of frontier speech models. However, defining and evaluating prosody in speech synthesis remains a significant challenge, particularly as the field shifts toward nuanced approaches like prosody translation. Recent research has started addressing this gap by exploring methods to evaluate prosodic intent—an area where real-world applications outpace academic progress. At Papercup, we’ve been thinking about this to meet the demands of AI dubbing.

Preference without purpose

TTS Arena, is a Hugging Face space for ranking speech models, which exemplifies how traditional approaches limit progress. Users compare two sentences and select their preference, but this evaluation is divorced from the speech’s purpose. It provides little insight into what was communicated or how well the speech fulfilled its intended function. Such methods fail to drive meaningful advancements in end-to-end speech synthesis without accounting for communicative context. To unlock the potential of modern systems, we need evaluations that measure how effectively they convey meaning, intent, and emotion in real-world scenarios.

Moving beyond classical evaluations

To design effective evaluation frameworks, we must move beyond intelligibility and naturalness, focusing instead on communicative success. The question isn’t just “Does it sound human?” but “Does it fulfill its purpose?”—whether that’s informing, reassuring, persuading, or entertaining.

Speech Act Theory provides a valuable lens for this shift. By examining the content (locutionary), intent (illocutionary), and impact (perlocutionary) of speech, we can create evaluations that truly reflect real-world communicative goals. Key questions include:

• Is the intended meaning conveyed clearly?

• Does the speech align with the context and intent?

• Does it achieve the desired effect on the listener?

A tutor bot might be evaluated on how effectively its tone and delivery enhance understanding and retention of key concepts, while voice-enabled interactive games could be evaluated on their ability to adapt tone and pacing to sustain user engagement. The combination of engagement-driven chatbots and expressive voices is bound to pose significant societal challenges.

As we advance evaluation frameworks, it’s essential to retain the human element. Speech, one of our oldest and most intuitive forms of communication, predates writing and is deeply rooted in human culture. Automation will be vital for scaling evaluations, but human oversight is critical to ensure these systems align with our values and communicate in meaningful, impactful ways. By blending human judgment with automated tools, we can create systems that prioritize real-world utility over mere model optimization.

Next up for speech evaluation

Modern speech synthesis systems seamlessly integrate text, audio, and other inputs, but evaluations often lag behind this multimodal complexity. Video, as a medium, provides a clearer lens for testing speech performance, particularly in capturing emotional nuance and aligning audio with visual context.

Scalability is another critical challenge. Data-labeling platforms excel in text and image evaluations but lack robust tools for assessing speech. Expanding to evaluate prosody, intent, and emotion requires addressing audio’s unique challenges, from its nuanced nature to cultural variability.

Closing these gaps will enable evaluation frameworks that go beyond naturalness to assess how effectively speech communicates meaning and intent in real-world scenarios.

Conclusion

AI-generated speech systems have reached a pivotal moment. Modern speech systems are highly capable, but their real value lies in enhancing communication—not just technical achievements. Speech systems must deliver meaning, intent, and emotion in natural, intuitive, and impactful ways.

To achieve this, evaluation must move beyond classical paradigms like TTS, embracing frameworks that reflect speech’s role as a communication medium. By designing systems that align with real-world contexts, we can build products that don’t just sound human but truly connect with users.

The evolution of speech in AI must do justice to its position as our most fundamental communication tool. The key to advancing speech technologies lies in critically reflecting on their purpose—to enhance communication.

How do humans learn to speak?

Prosody is the diverse phenomena of cadence and intonations that humans use to communicate. Babies learn prosody before they learn language. I could rely on academic references, but nothing will make this point hit home more than watching two babies talk in baby language in front of a fridge.

These babies know what they are talking about.

Babies first develop a foundational understanding of sound, and in particular prosody, before they can comprehend language. After learning this they layer language into their skillset. In fact, infants learn languages in part thanks to prosodic cues that they pick up on.

Audio, then language?

I've written about how solving the prosody problem is the most valuable problem in speech. Enabling computers to communicate with meaningful prosody is a foundational challenge in human-computer interaction, and would unlock incredible user benefits.

The traditional method in speech synthesis is text-to-speech. Models attempt to generate expressive speech by starting with language inputs and outputting prosodically appropriate speech.

To make the next leap in expressive synthetic speech perhaps we need to take a lesson out of these baby's books. We should first develop models with a foundational knowledge of audio (and prosody) and then layer a linguistic informed model of the world after (i.e. an LLM).

Generative audio models broadly will likely play an important role in the next leaps in expressive speech synthesis.

Teaching computers to speak: the prosody problem

Image: https://site-assets.plasmic.app/2f93a1a831c6cb808c7e4ab370e2c1ee.jpg

Article by Papercup's Head of Product, Kilian Butler

For hundreds of years, people have been striving to replicate the capabilities of human speech, employing technologies as varied as resonance tubes and machine learning. From HAL 9000 to C-3PO, KITT from ‘Knight Rider’ and Samantha from ‘Her’, fiction is rich with examples of computers that have mastered human speech.

AI and spatial computing are transitioning from science fiction into reality, marking the onset of a new generation of technology. Speech will be a core medium of interaction with AIs as they improve. To many it may feel like speech is mature compared to technologies like LLMs, but this underestimates how much information we convey in a sentence. Speech is a nuanced process of meaning-making that requires true understanding of human communication. In this blog we dive into this challenge: prosody generation!

The last few decades have seen significant improvements in electronic speech synthesis; progress that is audible when comparing the robotic tones of Stephen Hawking’s 1980s speech system, Equalizer, to the realistic speech produced by AI dubbing available on YouTube and streaming platforms today. However despite significant advancement in speech synthesis in recent years, computers still lag significantly behind humans in speech capabilities.

Machine learning techniques have advanced significantly but we must recognise that the creation of human speech is an extraordinarily complex process involving an intricate partnership between mind and body.

Speech enables humans to communicate using sound. It conveys information in two central ways: the words spoken (linguistic information) and how the words are said (prosodic information). How something is said can be as important as what is said. Prosody (how things are said) includes the patterns of stress, rhythm and intonation that determine how an utterance is delivered and communicates additional information to the words that allows us to infer meaning and intentions. Commonly referred to as expressivity, the prosody of a sentence can convey the speakers’ emotions, certainty, or any number of aspects related to their physical or mental state. Are they sincere or insincere? Was their speech planned or off-the-cuff? Prosody planning, the process by which we determine how to speak, is still relatively understudied and crucially cannot be separated from language production itself.

C-3PO, the metallic humanoid robot from the Star Wars series, represents the platonic ideal of a synthetic speaker.

He communicates his feelings clearly by producing the tones and intonations that leave no doubt of his worries and complaints, all delivered in Received Pronunciation British accent. He’s a perfect example of an advanced synthesis system that generates prosodically appropriate speech. To advance towards this vision, what must the next generation of speech synthesis models be capable of?

The prosody problem

The primary role of prosodic features in speech is to enhance communication between the speaker and the listener. Different subtexts in conversation can be communicated through prosodic information like:

Disagreeing with someone
Trying to subtly change the topic
Stopping someone interrupting you
Thinking while speaking slower
Showing you don’t care
Looking for an emotional response, e.g. empathy

The pitch, duration, inflections, intensity, loudness, and a whole host of other elements all represent meaning, or contribute to a function in spoken communications. This function refers to the illocutionary force of an utterance, which is defined as ‘the speaker’s intention in producing that utterance’. For example, if you didn’t want to to go an event, but your friend wants you to join them you might say “Yeah, maybe we should” but with a pause on “yeah” to display you’re uncertain. The illocutionary force here is you communicating that you don’t want to go (despite the words saying otherwise). The very common speech synthesis use case of audible directions (i.e. Garmin, Google Maps) are notable by their lack of illocutionary force. Research teams for these products will be looking to improve on the passive intonation, where all directions are treated equally.

Speech is an incredibly complex problem with multiple axes of variation. Minor changes in prosody can indicate significant changes in meaning, much in the same way that changes to the tone, style or grammar can convey subtext in language (try asking ChatGPT to make your next email passive aggressive).

Speech includes an array of other elements that are key in communications, but challenging to model: sarcasm, attitude (towards yourself or someone else), interruptions, laughter, filler phrases (ums and ahs etc.) and other non-verbal utterances. Speech is further enriched in its role as a communicative function by the presence of disfluencies and non-verbal speech.

A visual representation of speech & prosody. The grey waveform indicating the sound of the words ‘Speech’ & ‘Prosody’, with the black brushstrokes displaying the prosodic elements of pitch and periodicity. (Source: www.nigelward.com/prosody)

Current speech synthesis systems struggle to convey the depth of information that a human is capable of. The fantastic performance of voice cloning models can oftentimes obscure the relative paucity of prosodic features. Models with prosody generation and cross-lingual prosody transfer capabilities (speech-to-speech translation) are however excitingly starting to show promise, but there are unique aspects about the medium that will need to be addressed.

Synthetic speech products must generate both words and prosody from a given context. Large Language Models (LLMs) have displayed impressive improvements in computers’ ability to generate contextual language in the form of text. However, this addresses only half of what it takes to build communicative, intelligent systems. With LLMs, the task is to generate words; in speech models the task is to generate prosody (since synthetic speech can now generate high acoustic quality and accurate pronunciation). Future modelling improvements will necessitate appropriate prosody generation in a less sequential manner. Current state of the art multi-modal LLMs generate linguistic information and feed it to text-to-speech systems to infer prosody. In humans, however, the prosody and linguistic planning processes are more closely intertwined.

Even experts in the field of prosody still do not have a complete understanding of the structure and rules that we follow when communicating. English prosody is comparatively well understood, but there is still substantial debate among academics about how it really works. Understanding of non-English prosody is very limited and sparse, much less the cross-lingual mappings of prosody across languages, dialects and cultures.

The prosody problem exists among a set of problems that were previously out of reach for traditional software. These are challenges that relate to things like natural language, images, and physical space. One could refer to this type of thing as an ‘AI hard’ problem – a problem that can now conceivably be unlocked with modern machine learning techniques and hardware. Our limited understanding of prosody means that we cannot write an exhaustive set of rules which would govern how prosody changes meaning in context. In the same way that LLMs work better than rules based algorithms for contextual language, speech models must learn from data the ability to generate appropriate results from a given context. But what are the other unique aspects of speech that will pose challenges for product teams and researchers?

Other challenges in speech

Speech is a continuous signal processing challenge. This contrasts with image or language generation, which is considerably easier to represent. In this way, speech generation models are more analogous to generative video modelling, which is considerably less mature than image or language generation. This means speech is harder to tokenise and all tokens must be converted into analog signals. This last mile is not present in text or image generation.

Spoken languages are also extremely fragmented. Roughly two thirds of the world speak the top five languages. The remaining third of the world speaks a long tail of thousands of languages. The relative performance of LLMs across different languages (especially low-resource languages) is indicative of the challenges that speech generation will face.

The commercial attempts at productizing personal assistants can give a sense of the scale of the challenge. Apple’s Siri was launched in 2011 and Amazon’s Alexa has been funded to the tune of tens of billions of dollars. Both systems are still largely limited in their ability to generate prosodically appropriate speech despite extensive research and development. Both have made efforts to add some realistic prosodic feature generation, but when these features are applied to the wrong context it can be jarring, invoking the uncanny valley.

Despite the long history of vocoders and text-to-speech, speech is a much less mature field from an machine learning perspective with a comparatively smaller field of talent working on the challenges. Text-to-speech systems have been able to produce speech, which communicates lexical information (the words themselves) and was intelligible for several decades. In essence, text-to-speech was ‘good enough’ for a limited set of use cases for a long time. This influenced the shape of the research industry itself. Less machine learning talent flowed to speech (in favour of areas like computer vision or natural language processing), resulting in optimizations being applied at a component level (like vocoders) rather than in an end-to-end singular system.

Exciting progress has been made in recent years and there is significant potential upside in solving for prosody generation. As a result speech is already attracting more and more talent within machine learning, compounding the benefit to end users.

So what’s next?

Many modelling, data, architectural and operational techniques have yet to be applied in full to speech synthesis. However, we are beginning to see green shoots here and there is undoubtably great strides to be made in porting the learnings from other fields.

Machine learning teams will learn from and collaborate with anthropologists, linguists and other experts in the field of speech and language to deploy prosodic models globally. Communication is not a one size fits all system, with high and low context cultures deploying differing methods. Germanic cultures, for instance, communicate directly with language, whereas Asian or Latin cultures are more nuanced in their communication. Interestingly, it appears that human speech encodes information at roughly the same rate, regardless of language.

The broad application of speech encoding and decoding holds immense commercial promise. By enabling computers to generate and interpret prosodic nuances, we can surmount language barriers and facilitate smoother global communication. Advanced machine learning models, particularly those with multi-modal capabilities, are poised to become integral to human-computer interaction. In this dynamic landscape, AI laboratories, startups, and the open-source community are expected to integrate speech input and synthesis more deeply into their multi-modal systems. This integration will not only enhance the adoption of Large Language Models (LLMs) across businesses and consumer sectors but also enrich user engagement. Speech synthesis that is both expressive and captivating will be key in attracting and retaining users. Moreover, systems capable of perceptual decoding stand to respond more intuitively to user prompts and intentions, significantly elevating the user experience.

AI dubbing is a market ripe to power the acceleration of speech synthesis research. It is a prime example of deflationary AI software, which can provide accessibility for information and entertainment globally at cost and scale previously impossible. Human-in-the-loop (the process in which humans check and adjust the generated audio) can control for failure modes in prosody prediction and provide crucial data to improve model performance over time.

Conclusion

It's evident that while significant progress has been made in speech synthesis, the prosody problem will be the defining challenge of the next generation of models. The opportunity is vast – the ability to revolutionize how we interact with technology, enhancing global communication and accessibility. We must also be mindful of the ethical dimensions of this technology, ensuring it enriches rather than exploits our human interactions.

The journey towards creating lifelike synthetic speech is not just about technological achievement; it's about deepening our understanding of human communication and its potential in bridging divides

We might still be a long way from creating intelligent C-3PO level communicators, but the momentum in the field is building. Expect exciting things from Papercup and the industry as a whole.

--------------------------

Huge thanks to my colleagues at Papercup for their assistance in pulling this together. I’m very lucky to work alongside some of the world’s foremost experts in prosody and synthetic speech. Special thanks to Hannah, Zack, Simon, Devang, Doniyor, James, Prass, and Jesse for answering my many questions, editing and reading drafts. Additional thanks to Nigel Ward for his time and thoughts on prosody. And thank you to anyone who took the time to read this and found it interesting.

First published at https://www.papercup.com/blog/realistic-synthetic-voices

prosody

thoughts on the most important problem in speech

Speech AI has 10x’d: purpose, evaluation, and products

Speech as communication

Speech products in practice

AI dubbing (e.g. Papercup)

Mapping navigation (e.g. Google Maps, Apple Maps)

Digital avatar videos (e.g. HeyGen, Synthesia)

Evaluation as a key enabler

Preference without purpose

Moving beyond classical evaluations

Next up for speech evaluation

Conclusion

Babys talk in baby language in front of fridge

How do humans learn to speak?

Audio, then language?

Teaching computers to speak: the prosody problem

The prosody problem

A visual representation of speech & prosody. The grey waveform indicating the sound of the words ‘Speech’ & ‘Prosody’, with the black brushstrokes displaying the prosodic elements of pitch and periodicity. (Source: www.nigelward.com/prosody)

Other challenges in speech

So what’s next?

Conclusion

Speech as communication

Speech products in practice

AI dubbing (e.g. Papercup)

Mapping navigation (e.g. Google Maps, Apple Maps)

Digital avatar videos (e.g. HeyGen, Synthesia)

Evaluation as a key enabler

Preference without purpose

Moving beyond classical evaluations

Next up for speech evaluation

Conclusion

How do humans learn to speak?

Audio, then language?

The prosody problem

A visual representation of speech & prosody. The grey waveform indicating the sound of the words ‘Speech’ & ‘Prosody’, with the black brushstrokes displaying the prosodic elements of pitch and periodicity. (Source: www.nigelward.com/prosody)

Other challenges in speech

So what’s next?

Conclusion

Subscribe by email

Subscribe by email