Audio-driven Gesture Generation via Deviation Feature in the Latent Space
Jiahui Chen, Yang Huan, Runhua Shi, Chanfan Ding, Xiaoqi Mo, Siyu Xiong, Yinong He
TL;DR
This work addresses the challenge of generating realistic co-speech gesture videos by introducing a weakly supervised framework that models pixel-level motion deviations in latent space and integrates them via a diffusion-based motion generator. The method comprises a deviation module (latent deviation extractor, warping calculator, latent deviation decoder) and a two-stage training regime, where Stage 1 learns base motion representations and Stage 2 leverages a latent motion diffusion model with temporal priors to stabilize animation. Key contributions include the latent deviation module, a weakly supervised learning strategy for deviation representation, and a diffusion-based pipeline that yields improved gesture fidelity, timing synchronization, and appearance details across hands, lips, and face. Evaluations on the PATS dataset show consistent gains over prior state-of-the-art methods in objective metrics (FGD, FVD, Diversity, BAS) and subjective user studies, demonstrating practical impact for audio-driven gesture synthesis in realistic avatars and video production.
Abstract
Gestures are essential for enhancing co-speech communication, offering visual emphasis and complementing verbal interactions. While prior work has concentrated on point-level motion or fully supervised data-driven methods, we focus on co-speech gestures, advocating for weakly supervised learning and pixel-level motion deviations. We introduce a weakly supervised framework that learns latent representation deviations, tailored for co-speech gesture video generation. Our approach employs a diffusion model to integrate latent motion features, enabling more precise and nuanced gesture representation. By leveraging weakly supervised deviations in latent space, we effectively generate hand gestures and mouth movements, crucial for realistic video production. Experiments show our method significantly improves video quality, surpassing current state-of-the-art techniques.
