Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers

Sohan Anisetty; James Hays

Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers

Sohan Anisetty, James Hays

TL;DR

This work tackles multimodal whole-body motion generation conditioned on text and audio. It combines three dedicated VQ-VAE modules to discretize body and hand motions with a bidirectional Masked Language Modeling framework and a token critic, enabling parallel token prediction and significantly faster sampling. A global motion predictor decouples root translation from local motion, while cross-attention and FiLM layers provide flexible conditioning for text and audio. Empirical results on standard motion datasets show improved motion-text alignment, realism, and the ability to generate long-form sequences and perform motion editing, with strong potential for animation, VR, and HCI applications.

Abstract

Our research presents a novel motion generation framework designed to produce whole-body motion sequences conditioned on multiple modalities simultaneously, specifically text and audio inputs. Leveraging Vector Quantized Variational Autoencoders (VQVAEs) for motion discretization and a bidirectional Masked Language Modeling (MLM) strategy for efficient token prediction, our approach achieves improved processing efficiency and coherence in the generated motions. By integrating spatial attention mechanisms and a token critic we ensure consistency and naturalness in the generated motions. This framework expands the possibilities of motion generation, addressing the limitations of existing approaches and opening avenues for multimodal motion synthesis.

Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers

TL;DR

Abstract

Paper Structure (23 sections, 5 equations, 2 figures, 2 tables)

This paper contains 23 sections, 5 equations, 2 figures, 2 tables.

Introduction
Related Work
Vector Quantization
Motion Synthesis
Text conditioned motion generation
Music conditioned dance generation
Co-speech gesture generation
Masked modelling for generation
System overview
Pose Representation:
Conditioning Representation:
VQVAE
Training objective:
Global Motion Predictor
Local Motion Generator
...and 8 more sections

Figures (2)

Figure 1: Our 3 stage motion generation pipeline: (a) Initial tokenization of the whole-body motion sequence, excluding translation, into three distinct motion sequences using VQ-VAEs dedicated to the body, left hand, and right hand. (b) During training, a random subset of tokens is masked in the input, and the model is tasked with predicting these missing tokens. A token critic is trained to discern between ground truth and predicted tokens. During inference, all motion indices in a sequence are simultaneously predicted, with the token critic guiding the decision on which indices to retain, remask, and resample. These indices are then mapped to the corresponding local motion using the VQVAE decoder. (c) A global motion predictor is trained to map body joint positions and velocities to root translation. During inference, this predictor is utilized to derive root translation from the predicted local motion.
Figure 2: Visual results on audio and text conditions: From top to bottom: Dance generated on break dance music, Dance generated on break dance music along with the text "a person doing ballet", The text "a person sneaks away while walking sideways". Only key frames are shown.

Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers

TL;DR

Abstract

Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (2)