DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation
Junming Chen, Yunfei Liu, Jianan Wang, Ailing Zeng, Yu Li, Qifeng Chen
TL;DR
DiffSHEG addresses the challenge of jointly generating synchronized 3D facial expressions and body gestures driven by speech. It introduces a diffusion-based framework with a UniEG Transformer that enforces uni-directional information flow from expression to gesture, and a fast outpainting-based sampling method (FOPPAS) to support arbitrary-length sequences in real time. The approach achieves state-of-the-art performance on BEAT and SHOW, validated by quantitative metrics and user studies, and runs at around 31 FPS on a single GPU. This work advances digital humans by enabling realistic, synchronized, and scalable speech-driven motion for immersive interfaces and embodied agents.
Abstract
We propose DiffSHEG, a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length. While previous works focused on co-speech gesture or expression generation individually, the joint generation of synchronized expressions and gestures remains barely explored. To address this, our diffusion-based co-speech motion generation transformer enables uni-directional information flow from expression to gesture, facilitating improved matching of joint expression-gesture distributions. Furthermore, we introduce an outpainting-based sampling strategy for arbitrary long sequence generation in diffusion models, offering flexibility and computational efficiency. Our method provides a practical solution that produces high-quality synchronized expression and gesture generation driven by speech. Evaluated on two public datasets, our approach achieves state-of-the-art performance both quantitatively and qualitatively. Additionally, a user study confirms the superiority of DiffSHEG over prior approaches. By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.
