Table of Contents
Fetching ...

Audio-driven Gesture Generation via Deviation Feature in the Latent Space

Jiahui Chen, Yang Huan, Runhua Shi, Chanfan Ding, Xiaoqi Mo, Siyu Xiong, Yinong He

TL;DR

This work addresses the challenge of generating realistic co-speech gesture videos by introducing a weakly supervised framework that models pixel-level motion deviations in latent space and integrates them via a diffusion-based motion generator. The method comprises a deviation module (latent deviation extractor, warping calculator, latent deviation decoder) and a two-stage training regime, where Stage 1 learns base motion representations and Stage 2 leverages a latent motion diffusion model with temporal priors to stabilize animation. Key contributions include the latent deviation module, a weakly supervised learning strategy for deviation representation, and a diffusion-based pipeline that yields improved gesture fidelity, timing synchronization, and appearance details across hands, lips, and face. Evaluations on the PATS dataset show consistent gains over prior state-of-the-art methods in objective metrics (FGD, FVD, Diversity, BAS) and subjective user studies, demonstrating practical impact for audio-driven gesture synthesis in realistic avatars and video production.

Abstract

Gestures are essential for enhancing co-speech communication, offering visual emphasis and complementing verbal interactions. While prior work has concentrated on point-level motion or fully supervised data-driven methods, we focus on co-speech gestures, advocating for weakly supervised learning and pixel-level motion deviations. We introduce a weakly supervised framework that learns latent representation deviations, tailored for co-speech gesture video generation. Our approach employs a diffusion model to integrate latent motion features, enabling more precise and nuanced gesture representation. By leveraging weakly supervised deviations in latent space, we effectively generate hand gestures and mouth movements, crucial for realistic video production. Experiments show our method significantly improves video quality, surpassing current state-of-the-art techniques.

Audio-driven Gesture Generation via Deviation Feature in the Latent Space

TL;DR

This work addresses the challenge of generating realistic co-speech gesture videos by introducing a weakly supervised framework that models pixel-level motion deviations in latent space and integrates them via a diffusion-based motion generator. The method comprises a deviation module (latent deviation extractor, warping calculator, latent deviation decoder) and a two-stage training regime, where Stage 1 learns base motion representations and Stage 2 leverages a latent motion diffusion model with temporal priors to stabilize animation. Key contributions include the latent deviation module, a weakly supervised learning strategy for deviation representation, and a diffusion-based pipeline that yields improved gesture fidelity, timing synchronization, and appearance details across hands, lips, and face. Evaluations on the PATS dataset show consistent gains over prior state-of-the-art methods in objective metrics (FGD, FVD, Diversity, BAS) and subjective user studies, demonstrating practical impact for audio-driven gesture synthesis in realistic avatars and video production.

Abstract

Gestures are essential for enhancing co-speech communication, offering visual emphasis and complementing verbal interactions. While prior work has concentrated on point-level motion or fully supervised data-driven methods, we focus on co-speech gestures, advocating for weakly supervised learning and pixel-level motion deviations. We introduce a weakly supervised framework that learns latent representation deviations, tailored for co-speech gesture video generation. Our approach employs a diffusion model to integrate latent motion features, enabling more precise and nuanced gesture representation. By leveraging weakly supervised deviations in latent space, we effectively generate hand gestures and mouth movements, crucial for realistic video production. Experiments show our method significantly improves video quality, surpassing current state-of-the-art techniques.

Paper Structure

This paper contains 18 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Examples of our generated gesture videos. White dashed arrows indicate gestures corresponding to bold words. The red dotted boxes indicate the mouth shapes corresponding to the italicized words.
  • Figure 2: Co-speech gesture video generation pipeline of our proposed method consists of three main components: 1) the latent deviation extractor (yellow) extracts motion features from videos and predicts optical flow; 2) the latent deviation decoder (blue) applies deviation to the motion optical flow and decodes the image features to reconstruct the image; 3) the latent motion diffusion (green) generates motion features based on the given speech.
  • Figure 3: The deviation in latent representation. Heatmap shows the deviation of gestures and other movements in the latent space.
  • Figure 4: Visual comparison with SOTAs. Our method generates gestures with more extensive accurate motions (dashed boxes), matching audio and semantics. Red boxes indicate unrealistic gestures generated by ANGIE liu2022audio, S2G-MDDiffusion he2024co and TANGO liu2024tango.
  • Figure 5: Comparison of drivers of the same and different people. The gesture videos we generate are not only better in gesture expression, but also more natural in facial expressions and lip movements.