How do humans learn to speak?
Prosody is the diverse phenomena of cadence and intonations that humans use to communicate. Babies learn prosody before they learn language. I could rely on academic references, but nothing will make this point hit home more than watching two babies talk in baby language in front of a fridge.
These babies know what they are talking about.
Babies first develop a foundational understanding of sound, and in particular prosody, before they can comprehend language. After learning this they layer language into their skillset. In fact, infants learn languages in part thanks to prosodic cues that they pick up on.
Audio, then language?
The traditional method in speech synthesis is text-to-speech. Models attempt to generate expressive speech by starting with language inputs and outputting prosodically appropriate speech.
To make the next leap in expressive synthetic speech perhaps we need to take a lesson out of these baby's books. We should first develop models with a foundational knowledge of audio (and prosody) and then layer a linguistic informed model of the world after (i.e. an LLM).
Generative audio models broadly will likely play an important role in the next leaps in expressive speech synthesis.