Table of Contents
Fetching ...

EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling

Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, Michael J. Black

TL;DR

EMAGE tackles the challenge of generating full-body, audio-synchronized co-speech gestures by introducing Masked Audio-Conditioned Gesture Modeling and a unified mesh-level dataset BEAT2 that combines refined SMPL-X body with FLAME head parameters. The framework uses two training pathways—Masked Gesture Reconstruction and Audio-Conditioned Gesture Generation—along with a Masked Audio Gesture Transformer and cross-attention to fuse audio with partially hidden gesture priors. Gestures are generated through four compositional VQ-VAEs (face, upper body, hands, lower body) plus a Global Motion Predictor for translations, with Content Rhythm Self-Attention adaptively blending rhythm and semantic content. BEAT2 enables holistic, high-fidelity motion with state-of-the-art results and supports training on non-holistic datasets, demonstrating improved realism, diversity, and audio synchronization for full-body gestures. The work contributes a large, standardized mesh-level dataset and a practical framework for unified co-speech gesture generation that can leverage partial inputs and multiple datasets.

Abstract

We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures, encompassing facial, local body, hands, and global movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. BEAT2 combines a MoShed SMPL-X body with FLAME head parameters and further refines the modeling of head, neck, and finger movements, offering a community-standardized, high-quality 3D motion captured dataset. EMAGE leverages masked body gesture priors during training to boost inference performance. It involves a Masked Audio Gesture Transformer, facilitating joint training on audio-to-gesture generation and masked gesture reconstruction to effectively encode audio and body gesture hints. Encoded body hints from masked gestures are then separately employed to generate facial and body movements. Moreover, EMAGE adaptively merges speech features from the audio's rhythm and content and utilizes four compositional VQ-VAEs to enhance the results' fidelity and diversity. Experiments demonstrate that EMAGE generates holistic gestures with state-of-the-art performance and is flexible in accepting predefined spatial-temporal gesture inputs, generating complete, audio-synchronized results. Our code and dataset are available https://pantomatrix.github.io/EMAGE/

EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling

TL;DR

EMAGE tackles the challenge of generating full-body, audio-synchronized co-speech gestures by introducing Masked Audio-Conditioned Gesture Modeling and a unified mesh-level dataset BEAT2 that combines refined SMPL-X body with FLAME head parameters. The framework uses two training pathways—Masked Gesture Reconstruction and Audio-Conditioned Gesture Generation—along with a Masked Audio Gesture Transformer and cross-attention to fuse audio with partially hidden gesture priors. Gestures are generated through four compositional VQ-VAEs (face, upper body, hands, lower body) plus a Global Motion Predictor for translations, with Content Rhythm Self-Attention adaptively blending rhythm and semantic content. BEAT2 enables holistic, high-fidelity motion with state-of-the-art results and supports training on non-holistic datasets, demonstrating improved realism, diversity, and audio synchronization for full-body gestures. The work contributes a large, standardized mesh-level dataset and a practical framework for unified co-speech gesture generation that can leverage partial inputs and multiple datasets.

Abstract

We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures, encompassing facial, local body, hands, and global movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. BEAT2 combines a MoShed SMPL-X body with FLAME head parameters and further refines the modeling of head, neck, and finger movements, offering a community-standardized, high-quality 3D motion captured dataset. EMAGE leverages masked body gesture priors during training to boost inference performance. It involves a Masked Audio Gesture Transformer, facilitating joint training on audio-to-gesture generation and masked gesture reconstruction to effectively encode audio and body gesture hints. Encoded body hints from masked gestures are then separately employed to generate facial and body movements. Moreover, EMAGE adaptively merges speech features from the audio's rhythm and content and utilizes four compositional VQ-VAEs to enhance the results' fidelity and diversity. Experiments demonstrate that EMAGE generates holistic gestures with state-of-the-art performance and is flexible in accepting predefined spatial-temporal gesture inputs, generating complete, audio-synchronized results. Our code and dataset are available https://pantomatrix.github.io/EMAGE/
Paper Structure (38 sections, 14 equations, 8 figures, 9 tables)

This paper contains 38 sections, 14 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: EMAGE. We present a Masked Audio-Conditioned Gesture Modeling framework, along with a new holistic gesture dataset, BEAT2 (BEAT-SMPLX-FLAME), for jointly generating facial expressions, local body dynamics, hand movements and global translations, conditioned on audio and a partially or completely masked gestures. The gray denotes visible gestures, and blue represents our outputs.
  • Figure 2: Comparison of Data between BEAT2 and Others. BEAT-SMPLX-FLAME presents a new mesh-level, motion-captured, holistic co-speech gesture dataset with 60h of data. Left: We compare our refined SMPL-X body parameters (denoted as Refined MoSh) with the original BEAT skeleton liu2022beat, Retargeted from AutoRegPro, and initial results of Mosh++ SMPL-X:2019. The refined results show correct neck flexion, appropriate head and neck shape ratios, and detailed finger representations. Right: Visualization of blendshape weights from the original BEAT dataset liu2022beat with ARKit's template, Wrapped-based, and handcrafted optimization. Our final handcrafted FLAME blendshape-based optimization demonstrates both accurate lip movement details and natural mouth shapes.
  • Figure 3: EMAGE leverages two training paths: Masked Gesture Reconstruction (MG2G) and Audio-Conditioned Gesture Generation (A2G). The MG2G path focuses on encoding robust body hints through a spatial-temporal transformer gesture encoder and cross-attention gesture decoder. In contrast, the A2G path utilizes these body hints, and separated audio encoders to decode pretrained face and body latent features. A key component in this process is a switchable cross-attention layer, crucial for merging body hints and audio features. This fusion allows the features to be effectively disentangled and utilized for gesture decoding. Once the gesture latent features are reconstructed, EMAGE employs a pretrained VQ-Decoder to decode face and local body motions. Additionally, a pretrained Global Motion Predictor is used to estimate global body translations, further enhancing the model's capability to generate realistic and coherent gestures.
  • Figure 4: Details of CRA and Pretrained VQ-VAEs.Left: Content Rhythm Attention fuses speech rhythm (onset and amplitude) with content (pretrained word embeddings from text scripts) adaptively. This results in a preference for either content or rhythm in specific frames, which encourages the generation of semantical-aware gestures. Right: We pretrain four compositional VQ-VAEs by reconstructing face, upper body, hands and lower body separately to disentangle audio-agnostic gestures explicitly.
  • Figure 5: Comparison of Forward Path Designs. Straightforward fusion module (a) merges audio features without refined body features and recombines audio features based only on position embedding. The Self-Attention decoder module (b), adopted in previous MLM models devlin2018bertlan2019albert, is limited for tasks requiring auto-regressive inference. Our design (c) considers effective audio feature fusion and auto-regressive decoding.
  • ...and 3 more figures