Table of Contents
Fetching ...

Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers

Yasheng Sun, Zhiliang Xu, Hang Zhou, Jiazhi Guan, Quanwei Yang, Kaisiyuan Wang, Borong Liang, Yingying Li, Haocheng Feng, Jingdong Wang, Ziwei Liu, Koike Hideki

TL;DR

Cosh-DiT tackles the challenging problem of synthesizing co-speech gesture videos that are synchronized with speech while maintaining photorealistic appearance. It introduces a two-stage diffusion framework: a discrete audio-driven gesture diffusion transformer (Cosh-DiT-A) that converts speech into a hybrid gesture representation, and a continuous video diffusion transformer (Cosh-DiT-V) that renders lifelike video conditioned on the generated motion. The system relies on a VQ-VAE-based discrete latent space to model upper-body poses and 3D hand meshes, along with a Geometric-Aware Alignment module to ensure accurate hand and wrist projection, and uses stacked iterative DiT blocks to fuse appearance, motion history, and gesture guidance. Quantitative and qualitative results show that Cosh-DiT achieves superior image quality, temporal coherence, and hand/facial details compared with state-of-the-art baselines, demonstrating its potential for realistic co-speech avatar animation and related applications.

Abstract

Co-speech gesture video synthesis is a challenging task that requires both probabilistic modeling of human gestures and the synthesis of realistic images that align with the rhythmic nuances of speech. To address these challenges, we propose Cosh-DiT, a Co-speech gesture video system with hybrid Diffusion Transformers that perform audio-to-motion and motion-to-video synthesis using discrete and continuous diffusion modeling, respectively. First, we introduce an audio Diffusion Transformer (Cosh-DiT-A) to synthesize expressive gesture dynamics synchronized with speech rhythms. To capture upper body, facial, and hand movement priors, we employ vector-quantized variational autoencoders (VQ-VAEs) to jointly learn their dependencies within a discrete latent space. Then, for realistic video synthesis conditioned on the generated speech-driven motion, we design a visual Diffusion Transformer (Cosh-DiT-V) that effectively integrates spatial and temporal contexts. Extensive experiments demonstrate that our framework consistently generates lifelike videos with expressive facial expressions and natural, smooth gestures that align seamlessly with speech.

Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers

TL;DR

Cosh-DiT tackles the challenging problem of synthesizing co-speech gesture videos that are synchronized with speech while maintaining photorealistic appearance. It introduces a two-stage diffusion framework: a discrete audio-driven gesture diffusion transformer (Cosh-DiT-A) that converts speech into a hybrid gesture representation, and a continuous video diffusion transformer (Cosh-DiT-V) that renders lifelike video conditioned on the generated motion. The system relies on a VQ-VAE-based discrete latent space to model upper-body poses and 3D hand meshes, along with a Geometric-Aware Alignment module to ensure accurate hand and wrist projection, and uses stacked iterative DiT blocks to fuse appearance, motion history, and gesture guidance. Quantitative and qualitative results show that Cosh-DiT achieves superior image quality, temporal coherence, and hand/facial details compared with state-of-the-art baselines, demonstrating its potential for realistic co-speech avatar animation and related applications.

Abstract

Co-speech gesture video synthesis is a challenging task that requires both probabilistic modeling of human gestures and the synthesis of realistic images that align with the rhythmic nuances of speech. To address these challenges, we propose Cosh-DiT, a Co-speech gesture video system with hybrid Diffusion Transformers that perform audio-to-motion and motion-to-video synthesis using discrete and continuous diffusion modeling, respectively. First, we introduce an audio Diffusion Transformer (Cosh-DiT-A) to synthesize expressive gesture dynamics synchronized with speech rhythms. To capture upper body, facial, and hand movement priors, we employ vector-quantized variational autoencoders (VQ-VAEs) to jointly learn their dependencies within a discrete latent space. Then, for realistic video synthesis conditioned on the generated speech-driven motion, we design a visual Diffusion Transformer (Cosh-DiT-V) that effectively integrates spatial and temporal contexts. Extensive experiments demonstrate that our framework consistently generates lifelike videos with expressive facial expressions and natural, smooth gestures that align seamlessly with speech.

Paper Structure

This paper contains 34 sections, 10 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Reconstruction with Varied Representations. Holistic 3D reconstruction via SMPL-X often exhibits inaccuracies in limb and joint positioning (red box). While 2D pose estimation (e.g., DwPose) provides accurate body joint localization, it struggles to represent complex hand gestures effectively (yellow box). Combination of them leads to precise overly.
  • Figure 2: Illustration of Cosh-DiT System. Given a human speech input and an arbitrary reference image, our framework consistently generates lifelike videos with synchronized gestures. The synthesized videos feature natural, rhythmic movements and expressive facial and hand gestures, capturing vivid details for a realistic portrayal.
  • Figure 3: Overview of Cosh-DiT System. On the left is the Cosh-DiT-A model which takes audio signals as input and processes discrete gesture representations. The DiT model takes masked noisy tokens as input and predicts the quantized gesture tokens. On the right is the Cosh-DiT-V model for co-speech gesture video synthesis. It takes rendered gesture representations, appearance references and precious motion frames as input to produce video frames.
  • Figure 4: Geometric-Aware Alignment. This module ensures that the key points of the hand model $[X_{k}, Y_{k}, Z_{k}]^\top$ in camera space are accurately projected to the 2D points $[u_k, v_k]^\top$ on the image plane by optimizing the translation vector $\mathbf{xyz} = [X_{tran}, Y_{tran}, Z_{tran}]^\top$.
  • Figure 5: Appearance Motion (AM) - DiT Architecture. Three types of information are simultaneously fed into the AM-DiT block. The noisy latents, conditioned on gesture guidance, are combined with the reference person identity information and previous motion features, which are iteratively integrated through a joint attention operation.
  • ...and 5 more figures