DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models

Sicheng Yang; Zhiyong Wu; Minglei Li; Zhensong Zhang; Lei Hao; Weihong Bao; Ming Cheng; Long Xiao

DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models

Sicheng Yang, Zhiyong Wu, Minglei Li, Zhensong Zhang, Lei Hao, Weihong Bao, Ming Cheng, Long Xiao

TL;DR

DiffuseStyleGesture tackles speech-driven co-speech gesture generation by integrating a time-aware diffusion model with cross-local and self-attention to align gestures with audio and semantics. It enables explicit style control via classifier-free guidance, allowing interpolation and extrapolation of gesture styles while maintaining speech coherence. Extensive human evaluations show competitive or superior performance against prior methods in human-likeness, gesture-speech alignment, and style appropriateness, with demonstrated diversity through varied seeds and noise. The approach advances controllable, high-quality co-speech gesture generation and points to future work on real-time efficiency and broader style representations.

Abstract

The art of communication beyond speech there are gestures. The automatic co-speech gesture generation draws much attention in computer animation. It is a challenging task due to the diversity of gestures and the difficulty of matching the rhythm and semantics of the gesture to the corresponding speech. To address these problems, we present DiffuseStyleGesture, a diffusion model based speech-driven gesture generation approach. It generates high-quality, speech-matched, stylized, and diverse co-speech gestures based on given speeches of arbitrary length. Specifically, we introduce cross-local attention and self-attention to the gesture diffusion pipeline to generate better speech matched and realistic gestures. We then train our model with classifier-free guidance to control the gesture style by interpolation or extrapolation. Additionally, we improve the diversity of generated gestures with different initial gestures and noise. Extensive experiments show that our method outperforms recent approaches on speech-driven gesture generation. Our code, pre-trained models, and demos are available at https://github.com/YoungSeng/DiffuseStyleGesture.

DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models

TL;DR

Abstract

Paper Structure (18 sections, 8 equations, 9 figures, 1 table)

This paper contains 18 sections, 8 equations, 9 figures, 1 table.

Introduction
Related Work
Co-speech Gesture Generation
Diffusion Models for Motion Generation
Our Approach
Diffusion Model for Gesture Generation
Attention-based Speech-driven Gesture Generation Model
Style-controllable Gesture Generation
Experiments
Comparison to Existing Methods
Gesture Controllability
Ablation Studies
Discussion and Conclusion
Overview
Groud Truth Gesture Clip
...and 3 more sections

Figures (9)

Figure 1: Gesture examples generated by our proposed method on various types of speech and styles. All characters used in the paper are publicly available.
Figure 2: (Top) Denoising module of DiffuseStyleGesture. A noising step $t$ and a noisy gesture sequence $x_t$ at this noising step conditioning on $c$ (including seed gesture $d$, style $s$, and audio $a$) are fed into the model. Cross-local attention and self-attention can better capture the correlations between speech and gesture based on WavLM features. Random masks in the seed gesture and style feature processing pipeline help classifier-free guidance training of the model and perform interpolation or extrapolation to achieve a high degree of control over the generated gestures. (Bottom) Sample module of DiffuseStyleGesture. At each step $t$, we predict the $\hat{x}_0$ with the denoising process based on the corresponding conditions, then add the noise to the noising step $x_{t-1}$ with the diffuse process. This process is repeated from $t$ = $T$ until $t=0$.
Figure 3: Different patterns of attention used in our experiments, where (a) and (c) are attention mechanisms used in our model and (b) is a pattern compared in Section \ref{['Ablation_sec']}. The rows represent the outputs and the columns represent the inputs. The colored squares highlight the relevant elements for each row of output.
Figure 4: Box plot visualizing comparison results of MOS for different models in different dimensions. The box extends from the first lower quartile (Q1) to the third greater quartile (Q3) of the data. The red line denotes the median. The notches represent the 95% confidence interval (CI) around the median. When the CI is less than Q1 or greater than Q3, the notch extends beyond the box, giving it a unique "flipped" appearance. We have also marked the mean and its 95% CI in the figure with a green dashed line and a blue vertical line, respectively.
Figure 5: The tSNE visualization of gestures with different styles and the shadow maps of the skeletal gesture with the corresponding style. For example, for the 'old' gesture, its waist and knees are more bent, and its hands are basically on the knees or waist.
...and 4 more figures

DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models

TL;DR

Abstract

DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)