MAG: Multi-Modal Aligned Autoregressive Co-Speech Gesture Generation without Vector Quantization
Binjie Liu, Lina Liu, Sanyi Zhang, Songen Gu, Yihao Zhi, Tianyi Zhu, Lei Yang, Long Ye
TL;DR
This paper addresses the challenge of realistic, diverse, and semantically aligned co-speech gesture generation without information loss from vector quantization. It introduces a two-stage framework: MTA-VAE learns continuous, multi-modal motion embeddings aligned to text and audio using contrastive losses with WavCaps embeddings, and MMAG performs diffusion-based autoregressive generation conditioned on a Hybrid Granularity Audio-Text Fusion Block, plus speaker identity encoding. Key contributions include the Motion-Text-Audio-Aligned VAE, the HGAT fusion block, and the diffusion-based MMAG that operates in continuous latent space, achieving state-of-the-art results on BEATv2 and SHOW with favorable user studies. The approach has practical impact for lifelike avatar synthesis and HCI by producing synchronized, realistic, and diverse co-speech gestures without the downsides of token-based discretization.
Abstract
This work focuses on full-body co-speech gesture generation. Existing methods typically employ an autoregressive model accompanied by vector-quantized tokens for gesture generation, which results in information loss and compromises the realism of the generated gestures. To address this, inspired by the natural continuity of real-world human motion, we propose MAG, a novel multi-modal aligned framework for high-quality and diverse co-speech gesture synthesis without relying on discrete tokenization. Specifically, (1) we introduce a motion-text-audio-aligned variational autoencoder (MTA-VAE), which leverages pre-trained WavCaps' text and audio embeddings to enhance both semantic and rhythmic alignment with motion, ultimately producing more realistic gestures. (2) Building on this, we propose a multimodal masked autoregressive model (MMAG) that enables autoregressive modeling in continuous motion embeddings through diffusion without vector quantization. To further ensure multi-modal consistency, MMAG incorporates a hybrid granularity audio-text fusion block, which serves as conditioning for diffusion process. Extensive experiments on two benchmark datasets demonstrate that MAG achieves stateof-the-art performance both quantitatively and qualitatively, producing highly realistic and diverse co-speech gestures.The code will be released to facilitate future research.
