Table of Contents
Fetching ...

BAMM: Bidirectional Autoregressive Motion Model

Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Pu Wang, Minwoo Lee, Srijan Das, Chen Chen

TL;DR

BAMM tackles text-to-motion generation by unifying denoising-style and autoregressive approaches through a VQ-VAE–based motion tokenizer and a conditional masked self-attention Transformer. Inference proceeds via cascaded decoding: an initial unidirectional pass predicts motion length and coarse content, followed by a bidirectional pass that refines tokens and enables editing, guided by classifier-free cues. Across HumanML3D and KIT-ML, BAMM achieves state-of-the-art quality with strong alignment to text and built-in editability, while handling length-predictive scenarios and enabling long-sequence generation. The work delivers a practical, end-to-end framework that enhances usability and flexibility for applications in animation, gaming, and immersive media.

Abstract

Generating human motion from text has been dominated by denoising motion models either through diffusion or generative masking process. However, these models face great limitations in usability by requiring prior knowledge of the motion length. Conversely, autoregressive motion models address this limitation by adaptively predicting motion endpoints, at the cost of degraded generation quality and editing capabilities. To address these challenges, we propose Bidirectional Autoregressive Motion Model (BAMM), a novel text-to-motion generation framework. BAMM consists of two key components: (1) a motion tokenizer that transforms 3D human motion into discrete tokens in latent space, and (2) a masked self-attention transformer that autoregressively predicts randomly masked tokens via a hybrid attention masking strategy. By unifying generative masked modeling and autoregressive modeling, BAMM captures rich and bidirectional dependencies among motion tokens, while learning the probabilistic mapping from textual inputs to motion outputs with dynamically-adjusted motion sequence length. This feature enables BAMM to simultaneously achieving high-quality motion generation with enhanced usability and built-in motion editability. Extensive experiments on HumanML3D and KIT-ML datasets demonstrate that BAMM surpasses current state-of-the-art methods in both qualitative and quantitative measures. Our project page is available at https://exitudio.github.io/BAMM-page

BAMM: Bidirectional Autoregressive Motion Model

TL;DR

BAMM tackles text-to-motion generation by unifying denoising-style and autoregressive approaches through a VQ-VAE–based motion tokenizer and a conditional masked self-attention Transformer. Inference proceeds via cascaded decoding: an initial unidirectional pass predicts motion length and coarse content, followed by a bidirectional pass that refines tokens and enables editing, guided by classifier-free cues. Across HumanML3D and KIT-ML, BAMM achieves state-of-the-art quality with strong alignment to text and built-in editability, while handling length-predictive scenarios and enabling long-sequence generation. The work delivers a practical, end-to-end framework that enhances usability and flexibility for applications in animation, gaming, and immersive media.

Abstract

Generating human motion from text has been dominated by denoising motion models either through diffusion or generative masking process. However, these models face great limitations in usability by requiring prior knowledge of the motion length. Conversely, autoregressive motion models address this limitation by adaptively predicting motion endpoints, at the cost of degraded generation quality and editing capabilities. To address these challenges, we propose Bidirectional Autoregressive Motion Model (BAMM), a novel text-to-motion generation framework. BAMM consists of two key components: (1) a motion tokenizer that transforms 3D human motion into discrete tokens in latent space, and (2) a masked self-attention transformer that autoregressively predicts randomly masked tokens via a hybrid attention masking strategy. By unifying generative masked modeling and autoregressive modeling, BAMM captures rich and bidirectional dependencies among motion tokens, while learning the probabilistic mapping from textual inputs to motion outputs with dynamically-adjusted motion sequence length. This feature enables BAMM to simultaneously achieving high-quality motion generation with enhanced usability and built-in motion editability. Extensive experiments on HumanML3D and KIT-ML datasets demonstrate that BAMM surpasses current state-of-the-art methods in both qualitative and quantitative measures. Our project page is available at https://exitudio.github.io/BAMM-page
Paper Structure (19 sections, 5 equations, 12 figures, 7 tables)

This paper contains 19 sections, 5 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: (a) Motion Length Prediction: Text-to-motion models often require specific input lengths, making them sensitive to motion generation. In contrast, BAMM automatically predicts the end of the motion, thus avoiding reliance on inaccurate motion length estimations. (b) High-quality Text-to-Motion: BAMM generates natural human movements precisely aligned with detailed textual descriptions. (c) Motion Editing: BAMM is capable of multiple editing tasks, such as inpainting (as demonstrated), outpainting, prefix prediction, suffix completion, and arbitrarily long motion sequence synthesis.
  • Figure 2: Overall architecture of BAMM. (a) Motion Tokenizer encodes the raw motion sequence into discrete motion tokens according to a learned codebook. (b) Masked Self-attention Transformer learns to sequentially predict next tokens conditioned on text embedding from CLIP model and future unmasked tokens. Masked self-attention mechanism unifies autoregressive model and generative masked motion via bidirectional and unidirectional causal masks.
  • Figure 3: Inference: Dual-iteration Cascaded Motion Decoding. In the first iteration, autoregressive decoding is applied by adopting unidirectional causal mask to generate coarse-grained motion and predict motion sequence length. In the second iteration, bidirectional autoregressive decoding is performed via bidirectional causal mask to removing and repredicting low-confidence motion tokens autoregressively.
  • Figure 4: Residual Motion Refinement. The residual vector quantization encodes the raw motion sequence into multiple token sequences in different colors (left). The base token sequence from the first vector quantizer is generated via cascaded decoding by masked self-attention transformer. The base token sequence is used as the input of the refinement transformer to predict the residual token sequences from other quantizers. The combined sequences are fed into tokenizer's decoder for motion generation. The refinement transformer shares the same architecture as the masked self-attention transformer with a full attention mask(right).
  • Figure 5: Visualization comparison of textual to motion to state-of-the-art methods. BAMM and T2M-GPT do not require motion length as an input. We use a pre-trained length estimator from t2m for MoMask and MDM. BAMM generates higher quality and is more correlated with textual descriptions.
  • ...and 7 more figures