This post is for anyone building AI models, rethinking human-computer interaction, or just wanting to build really great products. It explores the implications of advanced speech capabilities in machine learning, drawing on my five-plus years of experience and conversations building AI-native products.
AI speech has a significant capability overhang—its model capabilities far outpace their current user and economic impact. We now have the potential to generate expressive speech that can unlock transformative user experiences. From model development to product design, R&D teams must now confront the implications of computers that can truly communicate verbally.
Speech is one of our most fundamental forms of communication, yet it’s often treated as just another feature in modern products. While voice interfaces are widely recognized as key to the future, we continue to evaluate AI speech with outdated frameworks, akin to 1990s text-to-speech. This disconnect between speech’s potential and how we build and measure it is holding back the next wave of human-computer interaction.
OpenAI’s GPT-4o Advanced Voice Mode was the first widely available model to showcase multimodal speech capabilities, proving that The Bitter Lesson applies to speech: progress stems from scaling and generalization. But intelligence is no longer the moat—it’s a commodity. The real edge lies in how products leverage these capabilities, redefining speech as a dynamic communication medium.
Speech isn’t just about words—it’s about how they’re said. Prosody transforms meaning, intent, and emotion. This image captures its impact better than any explanation I’ve come across:
We’ll focus on the fundamental purpose of speech: enabling and enriching communication. From theory to practice, we’ll explore how modern speech synthesis solves real-world problems and creates value for businesses and users.
Speech as communication
To build great speech products, we need to understand the ‘job to be done’ of speech: communication. This starts with understanding the anatomy of speech events.
Every instance of spoken communication is a speech event—a single utterance or sentence conveying meaning. Each speech event contains two key components: lexical information (the words, or what is said) and prosodic information (the rhythm, pitch, and tone, or in other words how it’s said). Together, these elements create the message and effectively convey meaning.
Roman Jakobson’s model of speech events outlines six interdependent components—sender, receiver, context, message, contact (channel), and code—that together create meaning. A breakdown in any component can disrupt communication. For instance, if the code (shared language or conventions) doesn’t align with the context (situation or cultural backdrop), the receiver may misinterpret the prosody or intent within the message.

This interplay of factors is crucial for AI speech systems. Success doesn’t come from generating natural-sounding audio—it’s about harmonizing these components to ensure speech achieves its communicative purpose, whether guiding a driver or conveying emotion in a localized dramatic video.

Jakobson’s model offers a solid foundation but it doesn’t fully address how we assess the success of a speech event in practice. The diagram above hints at a positive or negative evaluation, but communication is rarely binary—the receiver may grasp parts of the message or interpret unintended meanings.
To better understand these challenges, Speech Act Theory adds nuance by highlighting three layers of communication:
• Locutionary Force: The literal meaning of the words
• Illocutionary Force: The intent behind the message
Speech products in practice
Let’s apply these observations to three product categories where speech is making a significant impact, each with unique challenges, opportunities, and economic implications.
AI dubbing (e.g. Papercup)
Traditionally, prosody transfer reproduced the rhythm, pitch, and tone of source speech directly, often failing to meet cultural expectations. The shift toward prosody translation adapts delivery to the target language and audience. For example, a passionate monologue in English may need a more restrained tone in German to convey authentically. This evolution ensures localized content feels natural, emotionally resonant, and culturally aligned.
Mapping navigation (e.g. Google Maps, Apple Maps)
Navigation systems rely on flat, monotonous delivery, leaving users to infer urgency or intent from words alone. This increases cognitive load and leads to frustration (speaking from experience!).
Digital avatar videos (e.g. HeyGen, Synthesia)
Digital avatars are revolutionizing video production by reducing costs and enabling rapid localization. These systems can generate videos in multiple languages from a single creation, but localization requires adjustments to the code to reflect linguistic and cultural conventions across regions.
In digital video, alignment between avatars and synthetic speech is crucial. Flat, robotic voices paired with hyper-realistic visuals—or expressive voices with stiff animations—diminish user experience. A safety training video calls for calm authority, while sales enablement content thrives on dynamic, engaging delivery. Cohesion across all elements is vital.
Evaluation as a key enabler

Beyond multimodality, advances in prosody are the hallmark of frontier speech models. However, defining and evaluating prosody in speech synthesis remains a significant challenge, particularly as the field shifts toward nuanced approaches like prosody translation. Recent research has started addressing this gap by exploring methods to evaluate prosodic intent—an area where real-world applications outpace academic progress. At Papercup, we’ve been thinking about this to meet the demands of AI dubbing.
Preference without purpose

Moving beyond classical evaluations
To design effective evaluation frameworks, we must move beyond intelligibility and naturalness, focusing instead on communicative success. The question isn’t just “Does it sound human?” but “Does it fulfill its purpose?”—whether that’s informing, reassuring, persuading, or entertaining.
Speech Act Theory provides a valuable lens for this shift. By examining the content (locutionary), intent (illocutionary), and impact (perlocutionary) of speech, we can create evaluations that truly reflect real-world communicative goals. Key questions include:
• Is the intended meaning conveyed clearly?
• Does the speech align with the context and intent?
A tutor bot might be evaluated on how effectively its tone and delivery enhance understanding and retention of key concepts, while voice-enabled interactive games could be evaluated on their ability to adapt tone and pacing to sustain user engagement. The combination of engagement-driven chatbots and expressive voices is bound to pose significant societal challenges.
Next up for speech evaluation
Modern speech synthesis systems seamlessly integrate text, audio, and other inputs, but evaluations often lag behind this multimodal complexity. Video, as a medium, provides a clearer lens for testing speech performance, particularly in capturing emotional nuance and aligning audio with visual context.
Scalability is another critical challenge. Data-labeling platforms excel in text and image evaluations but lack robust tools for assessing speech. Expanding to evaluate prosody, intent, and emotion requires addressing audio’s unique challenges, from its nuanced nature to cultural variability.
Conclusion
AI-generated speech systems have reached a pivotal moment. Modern speech systems are highly capable, but their real value lies in enhancing communication—not just technical achievements. Speech systems must deliver meaning, intent, and emotion in natural, intuitive, and impactful ways.
To achieve this, evaluation must move beyond classical paradigms like TTS, embracing frameworks that reflect speech’s role as a communication medium. By designing systems that align with real-world contexts, we can build products that don’t just sound human but truly connect with users.
“I didn't have time to write a short letter, so I wrote a long one instead.”