Table of Contents
Fetching ...

Expressivity and Speech Synthesis

Andreas Triantafyllopoulos, Björn W. Schuller

TL;DR

The chapter surveys expressivity in speech synthesis, tracing its evolution from basic voice modulation to Stage II capabilities capable of sustained, context-aware expression in dialogue. It argues that Stage I expressive primitives form the building blocks for longer-term behaviours and discusses Stage II frameworks based on learnt policies, mixed-states, and personalised strategies, enabled by foundation models and multimodal AI. It also examines the societal implications of increasingly expressive AI voices—ranging from improved human–machine interaction to risks of manipulation, misinformation, and deepfakes—and proposes alignment and auditing strategies. The work aims to guide future ESS research by clarifying states/traits, control methods, and ethical considerations while highlighting practical pathways toward end-to-end, personalised, and responsibly deployed ESS systems.

Abstract

Imbuing machines with the ability to talk has been a longtime pursuit of artificial intelligence (AI) research. From the very beginning, the community has not only aimed to synthesise high-fidelity speech that accurately conveys the semantic meaning of an utterance, but also to colour it with inflections that cover the same range of affective expressions that humans are capable of. After many years of research, it appears that we are on the cusp of achieving this when it comes to single, isolated utterances. This unveils an abundance of potential avenues to explore when it comes to combining these single utterances with the aim of synthesising more complex, longer-term behaviours. In the present chapter, we outline the methodological advances that brought us so far and sketch out the ongoing efforts to reach that coveted next level of artificial expressivity. We also discuss the societal implications coupled with rapidly advancing expressive speech synthesis (ESS) technology and highlight ways to mitigate those risks and ensure the alignment of ESS capabilities with ethical norms.

Expressivity and Speech Synthesis

TL;DR

The chapter surveys expressivity in speech synthesis, tracing its evolution from basic voice modulation to Stage II capabilities capable of sustained, context-aware expression in dialogue. It argues that Stage I expressive primitives form the building blocks for longer-term behaviours and discusses Stage II frameworks based on learnt policies, mixed-states, and personalised strategies, enabled by foundation models and multimodal AI. It also examines the societal implications of increasingly expressive AI voices—ranging from improved human–machine interaction to risks of manipulation, misinformation, and deepfakes—and proposes alignment and auditing strategies. The work aims to guide future ESS research by clarifying states/traits, control methods, and ethical considerations while highlighting practical pathways toward end-to-end, personalised, and responsibly deployed ESS systems.

Abstract

Imbuing machines with the ability to talk has been a longtime pursuit of artificial intelligence (AI) research. From the very beginning, the community has not only aimed to synthesise high-fidelity speech that accurately conveys the semantic meaning of an utterance, but also to colour it with inflections that cover the same range of affective expressions that humans are capable of. After many years of research, it appears that we are on the cusp of achieving this when it comes to single, isolated utterances. This unveils an abundance of potential avenues to explore when it comes to combining these single utterances with the aim of synthesising more complex, longer-term behaviours. In the present chapter, we outline the methodological advances that brought us so far and sketch out the ongoing efforts to reach that coveted next level of artificial expressivity. We also discuss the societal implications coupled with rapidly advancing expressive speech synthesis (ESS) technology and highlight ways to mitigate those risks and ensure the alignment of ESS capabilities with ethical norms.
Paper Structure (27 sections, 5 equations, 5 figures, 1 algorithm)

This paper contains 27 sections, 5 equations, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: A non-exhaustive taxonomy of states and traits that humans express through their speech largely informed by previous work on recognising them (e. g., see http://www.compare.openaudio.eu/tasks/ as well as Schuller14-CPE). While these states might not all be relevant for ESS systems, they illustrate the plethora of styles that can be synthesised. Further, they help us distinguish between two crucial components -- how long each style lasts, and how persistent its appearance is.
  • Figure 2: Overview of different application domains that can benefit from expressive speech synthesis: Human-computer interactions entails the real-time communication between a human and an expressive chatbot; Content creation encompasses all possible forms of de novo artificial content creation (e. g., video narration); Voice enhancement is targeted to the manipulation of a real human's voice; Finally, computer-computer interaction sketches a scenario where expressive chatbots communicate with one another in-the-wild.
  • Figure 3: Overview of a typical ESS pipeline. An input text is first synthesised in a neutral style (gray) and then transformed to expressive speech (red) -- although these steps can also be integrated in an end-to-end model. The style is controlled either by a) a reference encoder which accepts as input a speech sample having the required style; b) a textual description in free text; c) a 'tag' which allows the user to select from a fixed set of predefined styles.
  • Figure 4: Overview of an Stage IIESS workflow, where an intelligent assistant pursues its overall goal of befriending a user, a goal which in turn guides each conversation. The middle panel shows the inner workings of the agent, while the side panels show the outcome of the conversation. The agent monitors the user's affective state and adjusts its responses accordingly, picking from an array of available expressive styles. We note that the styles available to the agent are not necessarily as interpretable as the ones we outline here; rather, we actually expect self-learning agents to develop their own internalised concepts which perhaps remain opaque to humans (see text for a more detailed discussion).
  • Figure 5: Blueprint for training a Stage IIESS pipeline. Training is first bootstrapped using available human-to-human conversations, with the agent rewarded for matching the next response by one of the two interlocutors. This initial training can be succeeded by a second reinforcement learning stage where the agent is fine-tuned on human-to-machine conversations, generated either on-policy (i. e., by the agent being currently trained, perhaps in online fashion) or off-policy (i. e., relying on prerecorded conversations).