Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

Xingqun Qi; Jiahao Pan; Peng Li; Ruibin Yuan; Xiaowei Chi; Mengfei Li; Wenhan Luo; Wei Xue; Shanghang Zhang; Qifeng Liu; Yike Guo

Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

Xingqun Qi, Jiahao Pan, Peng Li, Ruibin Yuan, Xiaowei Chi, Mengfei Li, Wenhan Luo, Wei Xue, Shanghang Zhang, Qifeng Liu, Yike Guo

TL;DR

This work proposes a novel weakly supervised training strategy to encourage authority gesture transitions and devise an emotion mixture mechanism that provides weak supervision based on a learnable mixed emotion label for transition gestures.

Abstract

Generating vivid and emotional 3D co-speech gestures is crucial for virtual avatar animation in human-machine interaction applications. While the existing methods enable generating the gestures to follow a single emotion label, they overlook that long gesture sequence modeling with emotion transition is more practical in real scenes. In addition, the lack of large-scale available datasets with emotional transition speech and corresponding 3D human gestures also limits the addressing of this task. To fulfill this goal, we first incorporate the ChatGPT-4 and an audio inpainting approach to construct the high-fidelity emotion transition human speeches. Considering obtaining the realistic 3D pose annotations corresponding to the dynamically inpainted emotion transition audio is extremely difficult, we propose a novel weakly supervised training strategy to encourage authority gesture transitions. Specifically, to enhance the coordination of transition gestures w.r.t different emotional ones, we model the temporal association representation between two different emotional gesture sequences as style guidance and infuse it into the transition generation. We further devise an emotion mixture mechanism that provides weak supervision based on a learnable mixed emotion label for transition gestures. Last, we present a keyframe sampler to supply effective initial posture cues in long sequences, enabling us to generate diverse gestures. Extensive experiments demonstrate that our method outperforms the state-of-the-art models constructed by adapting single emotion-conditioned counterparts on our newly defined emotion transition task and datasets. Our code and dataset will be released on the project page: https://xingqunqi-lab.github.io/Emo-Transition-Gesture/.

Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

TL;DR

Abstract

Paper Structure (31 sections, 10 equations, 9 figures, 5 tables)

This paper contains 31 sections, 10 equations, 9 figures, 5 tables.

Introduction
Related Work
Co-speech Gesture Synthesis
3D Human Motion Modeling
Proposed Method
Emotion Transition Dataset Construction
Problem Formulation
Weakly-supervised Emotion Transition
Objective Functions
Experiments
Datasets and Experimental Setting
Quantitative Evaluation
Qualitative Evaluation
Conclusion
Overview
...and 16 more sections

Figures (9)

Figure 1: Diverse exemplary clips sampled by our method from our newly collected BEAT Emotion Transition Dataset. The vital frames are visualized to demonstrate that the upper body gestures change with the emotion transition of human speech, synchronously. From top to bottom: the input speech audio, the corresponding transcript, and two sampled clips. Best view on screen.
Figure 2: The overview of our proposed method. The middle part (blue) displays the overall pipeline for 3D co-speech gesture generation from emotion transition human speech. The left part (green) depicts our proposed Motion Transition Infusion Mechanism that enhances the coordination of transition gestures w.r.t. head/tail ones. The right part (orange) shows the designed Emotion Mixture Strategy to provide weak supervision of the generated transition gestures, thereby achieving authority producing.
Figure 3: Visualization of our generated 3D co-speech gestures against various state-of-the-art methods. The samples of the left part are from our newly collected TED-ETrans dataset, and the samples of the right part are from our BEAT-ETrans dataset. Best view on screen.
Figure 4: User study on gesture naturalness, motion smoothness, and speech-gesture coherency.
Figure 5: The pipeline of dataset construction. Head and tail audios as well as the corresponding transcripts are fed into the pipeline to generate a smooth and high quality transition.
...and 4 more figures

Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

TL;DR

Abstract

Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)