Babys talk in baby language in front of fridge

How do humans learn to speak?

Prosody is the diverse phenomena of cadence and intonations that humans use to communicate. Babies learn prosody before they learn language. I could rely on academic references, but nothing will make this point hit home more than watching two babies talk in baby language in front of a fridge.

These babies know what they are talking about.

Babies first develop a foundational understanding of sound, and in particular prosody, before they can comprehend language. After learning this they layer language into their skillset. In fact, infants learn languages in part thanks to prosodic cues that they pick up on


Audio, then language?

I've written about how solving the prosody problem is the most valuable problem in speech. Enabling computers to communicate with meaningful prosody is a foundational challenge in human-computer interaction, and would unlock incredible user benefits.

The traditional method in speech synthesis is text-to-speech. Models attempt to generate expressive speech by starting with language inputs and outputting prosodically appropriate speech. 



To make the next leap in expressive synthetic speech perhaps we need to take a lesson out of these baby's books. We should first develop models with a foundational knowledge of audio (and prosody) and then layer a linguistic informed model of the world after (i.e. an LLM). 

Generative audio models broadly will likely play an important role in the next leaps in expressive speech synthesis.