Table of Contents
Fetching ...

Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers

Sohan Anisetty, James Hays

TL;DR

This work tackles multimodal whole-body motion generation conditioned on text and audio. It combines three dedicated VQ-VAE modules to discretize body and hand motions with a bidirectional Masked Language Modeling framework and a token critic, enabling parallel token prediction and significantly faster sampling. A global motion predictor decouples root translation from local motion, while cross-attention and FiLM layers provide flexible conditioning for text and audio. Empirical results on standard motion datasets show improved motion-text alignment, realism, and the ability to generate long-form sequences and perform motion editing, with strong potential for animation, VR, and HCI applications.

Abstract

Our research presents a novel motion generation framework designed to produce whole-body motion sequences conditioned on multiple modalities simultaneously, specifically text and audio inputs. Leveraging Vector Quantized Variational Autoencoders (VQVAEs) for motion discretization and a bidirectional Masked Language Modeling (MLM) strategy for efficient token prediction, our approach achieves improved processing efficiency and coherence in the generated motions. By integrating spatial attention mechanisms and a token critic we ensure consistency and naturalness in the generated motions. This framework expands the possibilities of motion generation, addressing the limitations of existing approaches and opening avenues for multimodal motion synthesis.

Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers

TL;DR

This work tackles multimodal whole-body motion generation conditioned on text and audio. It combines three dedicated VQ-VAE modules to discretize body and hand motions with a bidirectional Masked Language Modeling framework and a token critic, enabling parallel token prediction and significantly faster sampling. A global motion predictor decouples root translation from local motion, while cross-attention and FiLM layers provide flexible conditioning for text and audio. Empirical results on standard motion datasets show improved motion-text alignment, realism, and the ability to generate long-form sequences and perform motion editing, with strong potential for animation, VR, and HCI applications.

Abstract

Our research presents a novel motion generation framework designed to produce whole-body motion sequences conditioned on multiple modalities simultaneously, specifically text and audio inputs. Leveraging Vector Quantized Variational Autoencoders (VQVAEs) for motion discretization and a bidirectional Masked Language Modeling (MLM) strategy for efficient token prediction, our approach achieves improved processing efficiency and coherence in the generated motions. By integrating spatial attention mechanisms and a token critic we ensure consistency and naturalness in the generated motions. This framework expands the possibilities of motion generation, addressing the limitations of existing approaches and opening avenues for multimodal motion synthesis.
Paper Structure (23 sections, 5 equations, 2 figures, 2 tables)

This paper contains 23 sections, 5 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Our 3 stage motion generation pipeline: (a) Initial tokenization of the whole-body motion sequence, excluding translation, into three distinct motion sequences using VQ-VAEs dedicated to the body, left hand, and right hand. (b) During training, a random subset of tokens is masked in the input, and the model is tasked with predicting these missing tokens. A token critic is trained to discern between ground truth and predicted tokens. During inference, all motion indices in a sequence are simultaneously predicted, with the token critic guiding the decision on which indices to retain, remask, and resample. These indices are then mapped to the corresponding local motion using the VQVAE decoder. (c) A global motion predictor is trained to map body joint positions and velocities to root translation. During inference, this predictor is utilized to derive root translation from the predicted local motion.
  • Figure 2: Visual results on audio and text conditions: From top to bottom: Dance generated on break dance music, Dance generated on break dance music along with the text "a person doing ballet", The text "a person sneaks away while walking sideways". Only key frames are shown.