Table of Contents
Fetching ...

HoloGest: Decoupled Diffusion and Motion Priors for Generating Holisticly Expressive Co-speech Gestures

Yongkang Cheng, Shaoli Huang

TL;DR

HoloGest addresses the challenge of generating holistic co-speech gestures by decoupling diffusion priors for global motion and finger movements, learned from large-scale motion data to reduce reliance on audio and improve naturalness. It combines a semi-implicit, decoupled diffusion denoiser with motion priors (trajectory and finger) and a JEPA-based semantic alignment module to produce expressive, semantically aligned gestures efficiently. The approach leverages a 9D global trajectory descriptor $G=(\Delta x,\Delta y,\Delta z, rot6d)$ and dedicated finger priors, trained on extensive Mocap and sign-language datasets, then refined via a motion-prior optimizer during inference. Experimental results on BEATX and related datasets show superior gesture realism, better beat alignment, and significantly faster generation (e.g., 0.88 seconds for 2-second gestures) compared to 1000-step DDPM baselines, with strong user study validation. This work advances real-time, physically grounded co-speech gesture synthesis for immersive human-computer interaction.

Abstract

Animating virtual characters with holistic co-speech gestures is a challenging but critical task. Previous systems have primarily focused on the weak correlation between audio and gestures, leading to physically unnatural outcomes that degrade the user experience. To address this problem, we introduce HoleGest, a novel neural network framework based on decoupled diffusion and motion priors for the automatic generation of high-quality, expressive co-speech gestures. Our system leverages large-scale human motion datasets to learn a robust prior with low audio dependency and high motion reliance, enabling stable global motion and detailed finger movements. To improve the generation efficiency of diffusion-based models, we integrate implicit joint constraints with explicit geometric and conditional constraints, capturing complex motion distributions between large strides. This integration significantly enhances generation speed while maintaining high-quality motion. Furthermore, we design a shared embedding space for gesture-transcription text alignment, enabling the generation of semantically correct gesture actions. Extensive experiments and user feedback demonstrate the effectiveness and potential applications of our model, with our method achieving a level of realism close to the ground truth, providing an immersive user experience. Our code, model, and demo are are available at https://cyk990422.github.io/HoloGest.github.io/.

HoloGest: Decoupled Diffusion and Motion Priors for Generating Holisticly Expressive Co-speech Gestures

TL;DR

HoloGest addresses the challenge of generating holistic co-speech gestures by decoupling diffusion priors for global motion and finger movements, learned from large-scale motion data to reduce reliance on audio and improve naturalness. It combines a semi-implicit, decoupled diffusion denoiser with motion priors (trajectory and finger) and a JEPA-based semantic alignment module to produce expressive, semantically aligned gestures efficiently. The approach leverages a 9D global trajectory descriptor and dedicated finger priors, trained on extensive Mocap and sign-language datasets, then refined via a motion-prior optimizer during inference. Experimental results on BEATX and related datasets show superior gesture realism, better beat alignment, and significantly faster generation (e.g., 0.88 seconds for 2-second gestures) compared to 1000-step DDPM baselines, with strong user study validation. This work advances real-time, physically grounded co-speech gesture synthesis for immersive human-computer interaction.

Abstract

Animating virtual characters with holistic co-speech gestures is a challenging but critical task. Previous systems have primarily focused on the weak correlation between audio and gestures, leading to physically unnatural outcomes that degrade the user experience. To address this problem, we introduce HoleGest, a novel neural network framework based on decoupled diffusion and motion priors for the automatic generation of high-quality, expressive co-speech gestures. Our system leverages large-scale human motion datasets to learn a robust prior with low audio dependency and high motion reliance, enabling stable global motion and detailed finger movements. To improve the generation efficiency of diffusion-based models, we integrate implicit joint constraints with explicit geometric and conditional constraints, capturing complex motion distributions between large strides. This integration significantly enhances generation speed while maintaining high-quality motion. Furthermore, we design a shared embedding space for gesture-transcription text alignment, enabling the generation of semantically correct gesture actions. Extensive experiments and user feedback demonstrate the effectiveness and potential applications of our model, with our method achieving a level of realism close to the ground truth, providing an immersive user experience. Our code, model, and demo are are available at https://cyk990422.github.io/HoloGest.github.io/.

Paper Structure

This paper contains 15 sections, 8 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: A comparison of three methods: DSG, a diffusion-based co-speech gesture generation method using DDPM (stiff limbs, slow inference, physically unnatural); EMAGE, an autoregressive generation method using VAE (motion artifacts, global flipping, physically unnatural); and our proposed generation method (rich movements, lively fingers, physically natural). The transition from past frames to the current frame (every 10 frames) is represented by the gradient in virtual human color, from light to dark.
  • Figure 2: Our system comprises a semantic alignment module and two core components: (a) The semantic alignment module maps both the transcribed text and gesture sequence into the latent space simultaneously, further abstracting the semantic latent variables and aligning them with the gesture latent variables in a higher-level abstract space, serving as independent guiding tokens. (b) The semi-implicit decoupled denoiser, by introducing GAN and semi-implicit constraints, models the complex denoising distribution between adjacent large strides, accelerating generation by reducing the number of steps. (c) The motion prior optimization takes the denoised initial local gesture sequence as a condition, and in conjunction with the audio guiding signal, generates global motion and finger actions for the second time. This system requires no additional input and has no time constraints; any pure audio file can generate a set of vivid, natural, and high-quality holistic co-speech gesture sequences. 'r2l' represents converting the rotation representation to the coordinate representation using the SMPL model.
  • Figure 3: A comparison of three methods: DSG, a diffusion-based co-speech gesture generation method using DDPM (stiff limbs, slow inference, physically unnatural); EMAGE, an autoregressive generation method using VAE (motion artifacts, global flipping, physically unnatural); and our proposed generation method (rich movements, lively fingers, physically natural). We test on a sequence of an English-speaking presenter selected from BEATX. Red annotations indicate defects, while yellow annotations highlight advantages.