Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers
Yasheng Sun, Zhiliang Xu, Hang Zhou, Jiazhi Guan, Quanwei Yang, Kaisiyuan Wang, Borong Liang, Yingying Li, Haocheng Feng, Jingdong Wang, Ziwei Liu, Koike Hideki
TL;DR
Cosh-DiT tackles the challenging problem of synthesizing co-speech gesture videos that are synchronized with speech while maintaining photorealistic appearance. It introduces a two-stage diffusion framework: a discrete audio-driven gesture diffusion transformer (Cosh-DiT-A) that converts speech into a hybrid gesture representation, and a continuous video diffusion transformer (Cosh-DiT-V) that renders lifelike video conditioned on the generated motion. The system relies on a VQ-VAE-based discrete latent space to model upper-body poses and 3D hand meshes, along with a Geometric-Aware Alignment module to ensure accurate hand and wrist projection, and uses stacked iterative DiT blocks to fuse appearance, motion history, and gesture guidance. Quantitative and qualitative results show that Cosh-DiT achieves superior image quality, temporal coherence, and hand/facial details compared with state-of-the-art baselines, demonstrating its potential for realistic co-speech avatar animation and related applications.
Abstract
Co-speech gesture video synthesis is a challenging task that requires both probabilistic modeling of human gestures and the synthesis of realistic images that align with the rhythmic nuances of speech. To address these challenges, we propose Cosh-DiT, a Co-speech gesture video system with hybrid Diffusion Transformers that perform audio-to-motion and motion-to-video synthesis using discrete and continuous diffusion modeling, respectively. First, we introduce an audio Diffusion Transformer (Cosh-DiT-A) to synthesize expressive gesture dynamics synchronized with speech rhythms. To capture upper body, facial, and hand movement priors, we employ vector-quantized variational autoencoders (VQ-VAEs) to jointly learn their dependencies within a discrete latent space. Then, for realistic video synthesis conditioned on the generated speech-driven motion, we design a visual Diffusion Transformer (Cosh-DiT-V) that effectively integrates spatial and temporal contexts. Extensive experiments demonstrate that our framework consistently generates lifelike videos with expressive facial expressions and natural, smooth gestures that align seamlessly with speech.
