Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers
Sohan Anisetty, James Hays
TL;DR
This work tackles multimodal whole-body motion generation conditioned on text and audio. It combines three dedicated VQ-VAE modules to discretize body and hand motions with a bidirectional Masked Language Modeling framework and a token critic, enabling parallel token prediction and significantly faster sampling. A global motion predictor decouples root translation from local motion, while cross-attention and FiLM layers provide flexible conditioning for text and audio. Empirical results on standard motion datasets show improved motion-text alignment, realism, and the ability to generate long-form sequences and perform motion editing, with strong potential for animation, VR, and HCI applications.
Abstract
Our research presents a novel motion generation framework designed to produce whole-body motion sequences conditioned on multiple modalities simultaneously, specifically text and audio inputs. Leveraging Vector Quantized Variational Autoencoders (VQVAEs) for motion discretization and a bidirectional Masked Language Modeling (MLM) strategy for efficient token prediction, our approach achieves improved processing efficiency and coherence in the generated motions. By integrating spatial attention mechanisms and a token critic we ensure consistency and naturalness in the generated motions. This framework expands the possibilities of motion generation, addressing the limitations of existing approaches and opening avenues for multimodal motion synthesis.
