Table of Contents
Fetching ...

EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation

Xingqun Qi, Chen Liu, Lincheng Li, Jie Hou, Haoran Xin, Xin Yu

TL;DR

EmotionGesture tackles audio-driven emotional co-speech gesture generation by integrating an Emotion-Beat Mining module to extract emotion and rhythm cues, a Spatial-Temporal Prompter to produce smooth pose prompts from initial frames, and a transformer-based generator guided by an emotion-conditioned VAE. It introduces a beat-alignment loss $L_{beat}$ and a motion-smooth loss $L_{smooth}$ to ensure rhythm coherence and temporal stability, while enabling diverse emotional expressions. The approach is validated on BEAT and a newly collected TED Emotion dataset, achieving state-of-the-art metrics across $L2$, $MPJRE$, $FGD$, $BA$, $EA$, and Diversity, with qualitative user studies confirming improvements in naturalness and synchrony. The work provides a practical pipeline for realistic, emotion-aware 3D co-speech gestures and releases data and code to support further research.

Abstract

Generating vivid and diverse 3D co-speech gestures is crucial for various applications in animating virtual avatars. While most existing methods can generate gestures from audio directly, they usually overlook that emotion is one of the key factors of authentic co-speech gesture generation. In this work, we propose EmotionGesture, a novel framework for synthesizing vivid and diverse emotional co-speech 3D gestures from audio. Considering emotion is often entangled with the rhythmic beat in speech audio, we first develop an Emotion-Beat Mining module (EBM) to extract the emotion and audio beat features as well as model their correlation via a transcript-based visual-rhythm alignment. Then, we propose an initial pose based Spatial-Temporal Prompter (STP) to generate future gestures from the given initial poses. STP effectively models the spatial-temporal correlations between the initial poses and the future gestures, thus producing the spatial-temporal coherent pose prompt. Once we obtain pose prompts, emotion, and audio beat features, we will generate 3D co-speech gestures through a transformer architecture. However, considering the poses of existing datasets often contain jittering effects, this would lead to generating unstable gestures. To address this issue, we propose an effective objective function, dubbed Motion-Smooth Loss. Specifically, we model motion offset to compensate for jittering ground-truth by forcing gestures to be smooth. Last, we present an emotion-conditioned VAE to sample emotion features, enabling us to generate diverse emotional results. Extensive experiments demonstrate that our framework outperforms the state-of-the-art, achieving vivid and diverse emotional co-speech 3D gestures. Our code and dataset will be released at the project page: https://xingqunqi-lab.github.io/Emotion-Gesture-Web/

EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation

TL;DR

EmotionGesture tackles audio-driven emotional co-speech gesture generation by integrating an Emotion-Beat Mining module to extract emotion and rhythm cues, a Spatial-Temporal Prompter to produce smooth pose prompts from initial frames, and a transformer-based generator guided by an emotion-conditioned VAE. It introduces a beat-alignment loss and a motion-smooth loss to ensure rhythm coherence and temporal stability, while enabling diverse emotional expressions. The approach is validated on BEAT and a newly collected TED Emotion dataset, achieving state-of-the-art metrics across , , , , , and Diversity, with qualitative user studies confirming improvements in naturalness and synchrony. The work provides a practical pipeline for realistic, emotion-aware 3D co-speech gestures and releases data and code to support further research.

Abstract

Generating vivid and diverse 3D co-speech gestures is crucial for various applications in animating virtual avatars. While most existing methods can generate gestures from audio directly, they usually overlook that emotion is one of the key factors of authentic co-speech gesture generation. In this work, we propose EmotionGesture, a novel framework for synthesizing vivid and diverse emotional co-speech 3D gestures from audio. Considering emotion is often entangled with the rhythmic beat in speech audio, we first develop an Emotion-Beat Mining module (EBM) to extract the emotion and audio beat features as well as model their correlation via a transcript-based visual-rhythm alignment. Then, we propose an initial pose based Spatial-Temporal Prompter (STP) to generate future gestures from the given initial poses. STP effectively models the spatial-temporal correlations between the initial poses and the future gestures, thus producing the spatial-temporal coherent pose prompt. Once we obtain pose prompts, emotion, and audio beat features, we will generate 3D co-speech gestures through a transformer architecture. However, considering the poses of existing datasets often contain jittering effects, this would lead to generating unstable gestures. To address this issue, we propose an effective objective function, dubbed Motion-Smooth Loss. Specifically, we model motion offset to compensate for jittering ground-truth by forcing gestures to be smooth. Last, we present an emotion-conditioned VAE to sample emotion features, enabling us to generate diverse emotional results. Extensive experiments demonstrate that our framework outperforms the state-of-the-art, achieving vivid and diverse emotional co-speech 3D gestures. Our code and dataset will be released at the project page: https://xingqunqi-lab.github.io/Emotion-Gesture-Web/
Paper Structure (29 sections, 9 equations, 5 figures, 5 tables)

This paper contains 29 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Diverse emotional exemplary clips sampled by our EmotionGesture, the pause duration in transcript is padded as $\left ( \diamond \right )$. We identify the beat via frame-wise aligned utter words (pink) in audio-synchronized transcripts. Due to the noisy environment, it is improper to directly extract audio onsets (blue vertical lines in audio signal) as rhythmic indicators.
  • Figure 2: Overview of our proposed EmotionGesture framework. With extracted audio beat features $F^{B}$, and emotion features $F^{E}$, we could achieve the generation of audio-driven diverse emotional co-speech gestures. Our spatial-temporal prompter aims to obtain the enhanced temporal-coherency pose prompt based on the initial pose sequence.
  • Figure 3: Details of our proposed emotion sampling component and spatial-temporal prompter. (a) In the emotion sampling process, we introduce an emotion-conditioned VAE to obtain the diverse emotion embeddings $F^{E}$. Emotion code is the one-hot vector representing the different emotions. (b) The Spatial-Temporal Prompter aims to provide the temporal-coherency pose prompt to guide gesture generation.
  • Figure 4: Visualization of our predicted 3D hand gestures against various state-of-the-art methods yoon2019robotsginosar2019learningahuja2019language2poseliu2022beatyoon2020speechliu2022learningzhu2023tamingyi2022generating. From top to bottom, we show the three keyframes (an early, a middle, and a late one) of a pose sequence. Best view on screen.
  • Figure 5: Visual comparisons of ablation study. We show the key frames of the generated gestures. From top to bottom, we show four key frames (an early, two middle, and a late one) of a pose sequence. Best view on screen.