Table of Contents
Fetching ...

Learning Frame-Wise Emotion Intensity for Audio-Driven Talking-Head Generation

Jingyi Xu, Hieu Le, Zhixin Shu, Yang Wang, Yi-Hsuan Tsai, Dimitris Samaras

TL;DR

A talking-head framework that is capable of generating a variety of emotions with precise control over intensity levels is developed, achieved by learning a continuous emotion latent space, where emotion types are encoded within latent orientations and emotion intensity is reflected in latent norms.

Abstract

Human emotional expression is inherently dynamic, complex, and fluid, characterized by smooth transitions in intensity throughout verbal communication. However, the modeling of such intensity fluctuations has been largely overlooked by previous audio-driven talking-head generation methods, which often results in static emotional outputs. In this paper, we explore how emotion intensity fluctuates during speech, proposing a method for capturing and generating these subtle shifts for talking-head generation. Specifically, we develop a talking-head framework that is capable of generating a variety of emotions with precise control over intensity levels. This is achieved by learning a continuous emotion latent space, where emotion types are encoded within latent orientations and emotion intensity is reflected in latent norms. In addition, to capture the dynamic intensity fluctuations, we adopt an audio-to-intensity predictor by considering the speaking tone that reflects the intensity. The training signals for this predictor are obtained through our emotion-agnostic intensity pseudo-labeling method without the need of frame-wise intensity labeling. Extensive experiments and analyses validate the effectiveness of our proposed method in accurately capturing and reproducing emotion intensity fluctuations in talking-head generation, thereby significantly enhancing the expressiveness and realism of the generated outputs.

Learning Frame-Wise Emotion Intensity for Audio-Driven Talking-Head Generation

TL;DR

A talking-head framework that is capable of generating a variety of emotions with precise control over intensity levels is developed, achieved by learning a continuous emotion latent space, where emotion types are encoded within latent orientations and emotion intensity is reflected in latent norms.

Abstract

Human emotional expression is inherently dynamic, complex, and fluid, characterized by smooth transitions in intensity throughout verbal communication. However, the modeling of such intensity fluctuations has been largely overlooked by previous audio-driven talking-head generation methods, which often results in static emotional outputs. In this paper, we explore how emotion intensity fluctuates during speech, proposing a method for capturing and generating these subtle shifts for talking-head generation. Specifically, we develop a talking-head framework that is capable of generating a variety of emotions with precise control over intensity levels. This is achieved by learning a continuous emotion latent space, where emotion types are encoded within latent orientations and emotion intensity is reflected in latent norms. In addition, to capture the dynamic intensity fluctuations, we adopt an audio-to-intensity predictor by considering the speaking tone that reflects the intensity. The training signals for this predictor are obtained through our emotion-agnostic intensity pseudo-labeling method without the need of frame-wise intensity labeling. Extensive experiments and analyses validate the effectiveness of our proposed method in accurately capturing and reproducing emotion intensity fluctuations in talking-head generation, thereby significantly enhancing the expressiveness and realism of the generated outputs.
Paper Structure (27 sections, 2 equations, 7 figures, 6 tables)

This paper contains 27 sections, 2 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: An overview of our proposed method. To generate talking-heads with fluid intensity transitions, we first infer these fluctuations from audio inputs using an audio-to-intensity predictor $\mathbf{P}$. The training targets for this predictor are obtained through our emotion-agnostic intensity pseudo-labeling method. Then, for a specified driving emotion $e$ (e.g., happy), we map it onto our proposed emotion latent space to get $M(f_e)$ and adjust its norm based on the inferred intensity $\hat{L}$. These resulting intensity-aware emotion features $M^r(f_e)$ then serve as guiding signals for a transformer-based talking-head generation model, enabling the generation of lifelike talking-head videos.
  • Figure 2: We compare our method with several intensity-control methods for audio-driven talking-head generation. The first row represents the predicted intensity from the audio-to-intensity predictor, while the subsequent rows showcase the results of different methods utilizing the predicted intensity. We observe that the intensity of the emotions generated by our method highly correlates with the target intensity, resulting in more diverse and realistic talking-head results. In contrast, the generated intensity from other methods does not consistently align with the target intensity, as shown by the bounded frames in red.
  • Figure 3: Generalization to unseen emotions and identities.
  • Figure 4: We show a few frames from a real video, the inferred pseudo intensity and the corresponding generated video using the inferred intensity. Our pseudo-labeling method accurately captures the emotion intensity from real talking-heads.
  • Figure 5: The detailed architecture of the encoder, decoder of the audio-to-intensity predictor and the emotion adaptation network.
  • ...and 2 more figures