Table of Contents
Fetching ...

MAG: Multi-Modal Aligned Autoregressive Co-Speech Gesture Generation without Vector Quantization

Binjie Liu, Lina Liu, Sanyi Zhang, Songen Gu, Yihao Zhi, Tianyi Zhu, Lei Yang, Long Ye

TL;DR

This paper addresses the challenge of realistic, diverse, and semantically aligned co-speech gesture generation without information loss from vector quantization. It introduces a two-stage framework: MTA-VAE learns continuous, multi-modal motion embeddings aligned to text and audio using contrastive losses with WavCaps embeddings, and MMAG performs diffusion-based autoregressive generation conditioned on a Hybrid Granularity Audio-Text Fusion Block, plus speaker identity encoding. Key contributions include the Motion-Text-Audio-Aligned VAE, the HGAT fusion block, and the diffusion-based MMAG that operates in continuous latent space, achieving state-of-the-art results on BEATv2 and SHOW with favorable user studies. The approach has practical impact for lifelike avatar synthesis and HCI by producing synchronized, realistic, and diverse co-speech gestures without the downsides of token-based discretization.

Abstract

This work focuses on full-body co-speech gesture generation. Existing methods typically employ an autoregressive model accompanied by vector-quantized tokens for gesture generation, which results in information loss and compromises the realism of the generated gestures. To address this, inspired by the natural continuity of real-world human motion, we propose MAG, a novel multi-modal aligned framework for high-quality and diverse co-speech gesture synthesis without relying on discrete tokenization. Specifically, (1) we introduce a motion-text-audio-aligned variational autoencoder (MTA-VAE), which leverages pre-trained WavCaps' text and audio embeddings to enhance both semantic and rhythmic alignment with motion, ultimately producing more realistic gestures. (2) Building on this, we propose a multimodal masked autoregressive model (MMAG) that enables autoregressive modeling in continuous motion embeddings through diffusion without vector quantization. To further ensure multi-modal consistency, MMAG incorporates a hybrid granularity audio-text fusion block, which serves as conditioning for diffusion process. Extensive experiments on two benchmark datasets demonstrate that MAG achieves stateof-the-art performance both quantitatively and qualitatively, producing highly realistic and diverse co-speech gestures.The code will be released to facilitate future research.

MAG: Multi-Modal Aligned Autoregressive Co-Speech Gesture Generation without Vector Quantization

TL;DR

This paper addresses the challenge of realistic, diverse, and semantically aligned co-speech gesture generation without information loss from vector quantization. It introduces a two-stage framework: MTA-VAE learns continuous, multi-modal motion embeddings aligned to text and audio using contrastive losses with WavCaps embeddings, and MMAG performs diffusion-based autoregressive generation conditioned on a Hybrid Granularity Audio-Text Fusion Block, plus speaker identity encoding. Key contributions include the Motion-Text-Audio-Aligned VAE, the HGAT fusion block, and the diffusion-based MMAG that operates in continuous latent space, achieving state-of-the-art results on BEATv2 and SHOW with favorable user studies. The approach has practical impact for lifelike avatar synthesis and HCI by producing synchronized, realistic, and diverse co-speech gestures without the downsides of token-based discretization.

Abstract

This work focuses on full-body co-speech gesture generation. Existing methods typically employ an autoregressive model accompanied by vector-quantized tokens for gesture generation, which results in information loss and compromises the realism of the generated gestures. To address this, inspired by the natural continuity of real-world human motion, we propose MAG, a novel multi-modal aligned framework for high-quality and diverse co-speech gesture synthesis without relying on discrete tokenization. Specifically, (1) we introduce a motion-text-audio-aligned variational autoencoder (MTA-VAE), which leverages pre-trained WavCaps' text and audio embeddings to enhance both semantic and rhythmic alignment with motion, ultimately producing more realistic gestures. (2) Building on this, we propose a multimodal masked autoregressive model (MMAG) that enables autoregressive modeling in continuous motion embeddings through diffusion without vector quantization. To further ensure multi-modal consistency, MMAG incorporates a hybrid granularity audio-text fusion block, which serves as conditioning for diffusion process. Extensive experiments on two benchmark datasets demonstrate that MAG achieves stateof-the-art performance both quantitatively and qualitatively, producing highly realistic and diverse co-speech gestures.The code will be released to facilitate future research.

Paper Structure

This paper contains 12 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: On the left, inspired by the natural continuity of human motion, we observe that VQ-VAE discretizes motion data, while VAE preserves a continuous latent space that better reflects real motion. On the right, motivated by this, we propose MAG, a framework that enables autoregressive modeling in continuous motion embeddings through diffusion, eliminating vector quantization. Given speech audio and text transcripts as conditionings, our model generates motion embeddings, which are decoded into realistic gestures via a Motion VAE decoder.
  • Figure 2: Architecture of MAG. MAG generates realistic co-speech gestures in two stages: (a) MTA-VAE: Motion VAE encodes motion embeddings $\mathbf{e}_m$, which are aligned with WavCaps' text embeddings $\mathbf{z}_t$ and audio embeddings $\mathbf{z}_a$ through contrastive learning. (b) MMAG: MMAG utilizes an autoregressive model to predict each motion embedding, derived from MTA-VAE encoder.The diffusion process, which predicts noise and ultimately generates the motion embedding, is guided by a text-audio fusion mechanism within a hybrid granularity audio-text fusion block. This ensures coherence across modalities, with the final motion output being produced by the MTA-VAE decoder.
  • Figure 3: Inference. By feeding the noise and conditioning $\mathbf{c}$ from the MMAG input into the denoising network, we can accurately generate the motion embeddings, which are then reconstructed into the real motion by the motion VAE decoder.
  • Figure 4: Architecture of HGAT. The Hybrid Granularity Audio-Text Fusion Block processes audio and text inputs, extracting low-level audio features (MFCC) for rhythm synchronization and high-level features (HuBERT) for semantic understanding, while text features are extracted using fastText. These features are then fused through attention mechanisms .
  • Figure 5: Qualitative Comparison on BEATv2 Dataset. Compared to other methods, our MAG approach generates gestures that more closely resemble GroundTruth and achieve better synchronization with both audio and text input. The gestures respond to high-frequency segments in the audio and show semantically relevant movements for meaningful words such as "well," "become," "for me," and "feel shy."
  • ...and 1 more figures