Table of Contents
Fetching ...

DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation

Junming Chen, Yunfei Liu, Jianan Wang, Ailing Zeng, Yu Li, Qifeng Chen

TL;DR

DiffSHEG addresses the challenge of jointly generating synchronized 3D facial expressions and body gestures driven by speech. It introduces a diffusion-based framework with a UniEG Transformer that enforces uni-directional information flow from expression to gesture, and a fast outpainting-based sampling method (FOPPAS) to support arbitrary-length sequences in real time. The approach achieves state-of-the-art performance on BEAT and SHOW, validated by quantitative metrics and user studies, and runs at around 31 FPS on a single GPU. This work advances digital humans by enabling realistic, synchronized, and scalable speech-driven motion for immersive interfaces and embodied agents.

Abstract

We propose DiffSHEG, a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length. While previous works focused on co-speech gesture or expression generation individually, the joint generation of synchronized expressions and gestures remains barely explored. To address this, our diffusion-based co-speech motion generation transformer enables uni-directional information flow from expression to gesture, facilitating improved matching of joint expression-gesture distributions. Furthermore, we introduce an outpainting-based sampling strategy for arbitrary long sequence generation in diffusion models, offering flexibility and computational efficiency. Our method provides a practical solution that produces high-quality synchronized expression and gesture generation driven by speech. Evaluated on two public datasets, our approach achieves state-of-the-art performance both quantitatively and qualitatively. Additionally, a user study confirms the superiority of DiffSHEG over prior approaches. By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.

DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation

TL;DR

DiffSHEG addresses the challenge of jointly generating synchronized 3D facial expressions and body gestures driven by speech. It introduces a diffusion-based framework with a UniEG Transformer that enforces uni-directional information flow from expression to gesture, and a fast outpainting-based sampling method (FOPPAS) to support arbitrary-length sequences in real time. The approach achieves state-of-the-art performance on BEAT and SHOW, validated by quantitative metrics and user studies, and runs at around 31 FPS on a single GPU. This work advances digital humans by enabling realistic, synchronized, and scalable speech-driven motion for immersive interfaces and embodied agents.

Abstract

We propose DiffSHEG, a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length. While previous works focused on co-speech gesture or expression generation individually, the joint generation of synchronized expressions and gestures remains barely explored. To address this, our diffusion-based co-speech motion generation transformer enables uni-directional information flow from expression to gesture, facilitating improved matching of joint expression-gesture distributions. Furthermore, we introduce an outpainting-based sampling strategy for arbitrary long sequence generation in diffusion models, offering flexibility and computational efficiency. Our method provides a practical solution that produces high-quality synchronized expression and gesture generation driven by speech. Evaluated on two public datasets, our approach achieves state-of-the-art performance both quantitatively and qualitatively. Additionally, a user study confirms the superiority of DiffSHEG over prior approaches. By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.
Paper Structure (24 sections, 8 equations, 12 figures, 4 tables, 1 algorithm)

This paper contains 24 sections, 8 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: DiffSHEG is a unified co-speech expression and gesture generation system based on diffusion models. It captures the joint expression-gesture distribution by enabling the uni-directional information flow from expression to gesture inside the model.
  • Figure 2: DiffSHEG framework overview. Left: Audio Encoders and UniEG-Transformer Generator. Given an audio clip, we encode the audio into a low-level feature Mel-Spectrogram and a high-level HuBERT feature. An audio encoder learns a mid-level representation of speech. The audio features are concatenated with other optional temporal conditions and then fed into the UniEG Transformer Denoiser. The denoising block fuses the conditions with noisy motion at diffusion step t and feeds it into style-aware transformers to get the predicted noises. The uni-directional condition flow is enforced from expression to gesture for joint distribution learning. Right: The detailed architecture of style-aware Transformer encoder and motion-condition fusion residual block.
  • Figure 3: Illustration for outpainting-based arbitrary long sequence inference. Given a previous clip, we generate current clip by outpainting the remaining frames (light blue) according to the overlaping frames (deep blue). Each row of blue bar represents a motion clip from a single sampling process.
  • Figure 4: Qualitative Comparison on BEAT beat Dataset. In comparison to baseline methods, our approach generates a broader range of natural, agile, and diverse gestures that are closely synchronized with the audio input. When saying "journalist", the character driven by our motion raises double hands to stress this word; When saying "never", our motion shows two times up-and-down right hand and fingers, corresponding to the two syllables "ne" and "ver". The character is from MetaHuman metahuman rendered by Unreal Engine 5 ue5.
  • Figure 5: Motion Comparison on the SHOW talkshow Dataset. Our method generates more expressive and diverse motions than TalkShow talkshow and LS3DCG habibie2021learning in terms of both gesture and head pose diversity. Our results also show more agile motions than baselines.
  • ...and 7 more figures