Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion

Kiran Chhatre; Radek Daněček; Nikos Athanasiou; Giorgio Becherini; Christopher Peters; Michael J. Black; Timo Bolkart

Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion

Kiran Chhatre, Radek Daněček, Nikos Athanasiou, Giorgio Becherini, Christopher Peters, Michael J. Black, Timo Bolkart

TL;DR

AMUSE tackles emotional speech-driven 3D body animation by explicitly disentangling speech into content, emotion, and style latent vectors and conditioning a latent diffusion model on these factors. It couples a speech disentanglement module with a SMPL-X-based motion prior and a diffusion denoiser to generate gestures that are synchronized with speech and reflect the intended emotion. The work introduces end-to-end learnable components for content-emotion-style separation, a temporal motion prior, and a diffusion-based generator, enabling emotion editing and gesture style transfer across speakers. Quantitative and perceptual evaluations show AMUSE achieves state-of-the-art performance across beat alignment, gesture diversity, emotion classification accuracy, and semantic relevance, with qualitative results demonstrating realistic, emotion-consistent gestures. This framework enables controllable, diverse, and naturalistic 3D gesture synthesis for applications in AR/VR, games, and virtual assistants, with potential extensions to full-body locomotion and integrated facial expressions.

Abstract

Existing methods for synthesizing 3D human gestures from speech have shown promising results, but they do not explicitly model the impact of emotions on the generated gestures. Instead, these methods directly output animations from speech without control over the expressed emotion. To address this limitation, we present AMUSE, an emotional speech-driven body animation model based on latent diffusion. Our observation is that content (i.e., gestures related to speech rhythm and word utterances), emotion, and personal style are separable. To account for this, AMUSE maps the driving audio to three disentangled latent vectors: one for content, one for emotion, and one for personal style. A latent diffusion model, trained to generate gesture motion sequences, is then conditioned on these latent vectors. Once trained, AMUSE synthesizes 3D human gestures directly from speech with control over the expressed emotions and style by combining the content from the driving speech with the emotion and style of another speech sequence. Randomly sampling the noise of the diffusion model further generates variations of the gesture with the same emotional expressivity. Qualitative, quantitative, and perceptual evaluations demonstrate that AMUSE outputs realistic gesture sequences. Compared to the state of the art, the generated gestures are better synchronized with the speech content, and better represent the emotion expressed by the input speech. Our code is available at amuse.is.tue.mpg.de.

Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion

TL;DR

Abstract

Paper Structure (73 sections, 15 equations, 20 figures, 6 tables)

This paper contains 73 sections, 15 equations, 20 figures, 6 tables.

Introduction
Related Work
3D Conditional Human Motion Generation
Gesture Generation from Speech
Rule-based gesture synthesis.
Data-driven gesture synthesis.
Emotion Control
Method
Preliminary: Expressive 3D Body Model
Speech Disentanglement Model
Architecture.
Training.
Gesture Generation Model
Motion prior.
Diffusion process.
...and 58 more sections

Figures (20)

Figure 1: Goal. AMUSE generates realistic emotional 3D body gestures directly from a speech sequence (top). It provides user control over the generated emotion by combining the driving speech sequence with a different emotional audio (bottom).
Figure 2: Training. We train the motion prior ($\mathcal{P}_{E},\mathcal{P}_{D}$) and the latent denoiser $\Delta$ jointly, while keeping the audio encoding networks frozen. In the forward pass, we take an input audio $a^{1:T}$ and pose sequence $m^{1:T}$. Firstly, we do a forward pass of $m^{1:T}$ through $\mathcal{P}_{E}$ and $\mathcal{P}_{D}$ and compute $\mathcal{L}_{rec}$, $\mathcal{L}_{Vrec}$, and $\mathcal{L}_{KL}$. Then, we apply the diffusion process to a gradient-detached $\textup{sg}\left[z_m\right]$ obtaining the noisy $z_m^{(D)}$, which is then denoised with $\Delta$ and $\mathcal{L}_{LD}$ is computed. Finally, we use $\Delta$ to fully denoise $z_n$ into gradient-detached $\textup{sg}\left[z_{\tilde{m}}\right]$, further decode $\tilde{m}^{1:T}$ using $\mathcal{P}_{D}$, and compute $\mathcal{L}_{align}$ and $\mathcal{L}_{Valign}$.
Figure 3: Qualitative comparison across all emotions. We evaluate generation on different test audios. AMUSE exhibits well-synchronized beat gestures and consistently produces gestures that accurately convey the emotional content expressed in the input speech.
Figure 4: Qualitative comparison with baseline methods. The speech segment describes intense angry speech.
Figure 5: Qualitative evaluation of diverse generations. Multiple generations overlayed.
...and 15 more figures

Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion

TL;DR

Abstract

Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (20)