Expressivity and Speech Synthesis
Andreas Triantafyllopoulos, Björn W. Schuller
TL;DR
The chapter surveys expressivity in speech synthesis, tracing its evolution from basic voice modulation to Stage II capabilities capable of sustained, context-aware expression in dialogue. It argues that Stage I expressive primitives form the building blocks for longer-term behaviours and discusses Stage II frameworks based on learnt policies, mixed-states, and personalised strategies, enabled by foundation models and multimodal AI. It also examines the societal implications of increasingly expressive AI voices—ranging from improved human–machine interaction to risks of manipulation, misinformation, and deepfakes—and proposes alignment and auditing strategies. The work aims to guide future ESS research by clarifying states/traits, control methods, and ethical considerations while highlighting practical pathways toward end-to-end, personalised, and responsibly deployed ESS systems.
Abstract
Imbuing machines with the ability to talk has been a longtime pursuit of artificial intelligence (AI) research. From the very beginning, the community has not only aimed to synthesise high-fidelity speech that accurately conveys the semantic meaning of an utterance, but also to colour it with inflections that cover the same range of affective expressions that humans are capable of. After many years of research, it appears that we are on the cusp of achieving this when it comes to single, isolated utterances. This unveils an abundance of potential avenues to explore when it comes to combining these single utterances with the aim of synthesising more complex, longer-term behaviours. In the present chapter, we outline the methodological advances that brought us so far and sketch out the ongoing efforts to reach that coveted next level of artificial expressivity. We also discuss the societal implications coupled with rapidly advancing expressive speech synthesis (ESS) technology and highlight ways to mitigate those risks and ensure the alignment of ESS capabilities with ethical norms.
