Table of Contents
Fetching ...

BAD: Bidirectional Auto-regressive Diffusion for Text-to-Motion Generation

S. Rohollah Hosseyni, Ali Ahmad Rahmani, S. Jamal Seyedmohammadi, Sanaz Seyedin, Arash Mohammadi

TL;DR

BAD introduces a unified framework that combines autoregressive causality with bidirectional context for text-to-motion generation. It uses a two-stage pipeline: a simple VQ-VAE motion tokenizer to discretize motion, followed by a permutation-based corruption and a hybrid attention transformer trained to reconstruct tokens conditioned on text. Two inference schemes, OAAS and CBS, enable flexible, efficient sampling by leveraging permuted causal dependencies and confidence-based refinement. Empirical results on HumanML3D and KIT-ML show BAD substantially improves FID and maintains strong text-motion alignment, while remaining competitive with RVQ-VAE-based tokenizers and offering faster inference. The approach suggests a versatile pretraining strategy for sequence modeling that can extend to other modalities such as text, audio, and images.

Abstract

Autoregressive models excel in modeling sequential dependencies by enforcing causal constraints, yet they struggle to capture complex bidirectional patterns due to their unidirectional nature. In contrast, mask-based models leverage bidirectional context, enabling richer dependency modeling. However, they often assume token independence during prediction, which undermines the modeling of sequential dependencies. Additionally, the corruption of sequences through masking or absorption can introduce unnatural distortions, complicating the learning process. To address these issues, we propose Bidirectional Autoregressive Diffusion (BAD), a novel approach that unifies the strengths of autoregressive and mask-based generative models. BAD utilizes a permutation-based corruption technique that preserves the natural sequence structure while enforcing causal dependencies through randomized ordering, enabling the effective capture of both sequential and bidirectional relationships. Comprehensive experiments show that BAD outperforms autoregressive and mask-based models in text-to-motion generation, suggesting a novel pre-training strategy for sequence modeling. The codebase for BAD is available on https://github.com/RohollahHS/BAD.

BAD: Bidirectional Auto-regressive Diffusion for Text-to-Motion Generation

TL;DR

BAD introduces a unified framework that combines autoregressive causality with bidirectional context for text-to-motion generation. It uses a two-stage pipeline: a simple VQ-VAE motion tokenizer to discretize motion, followed by a permutation-based corruption and a hybrid attention transformer trained to reconstruct tokens conditioned on text. Two inference schemes, OAAS and CBS, enable flexible, efficient sampling by leveraging permuted causal dependencies and confidence-based refinement. Empirical results on HumanML3D and KIT-ML show BAD substantially improves FID and maintains strong text-motion alignment, while remaining competitive with RVQ-VAE-based tokenizers and offering faster inference. The approach suggests a versatile pretraining strategy for sequence modeling that can extend to other modalities such as text, audio, and images.

Abstract

Autoregressive models excel in modeling sequential dependencies by enforcing causal constraints, yet they struggle to capture complex bidirectional patterns due to their unidirectional nature. In contrast, mask-based models leverage bidirectional context, enabling richer dependency modeling. However, they often assume token independence during prediction, which undermines the modeling of sequential dependencies. Additionally, the corruption of sequences through masking or absorption can introduce unnatural distortions, complicating the learning process. To address these issues, we propose Bidirectional Autoregressive Diffusion (BAD), a novel approach that unifies the strengths of autoregressive and mask-based generative models. BAD utilizes a permutation-based corruption technique that preserves the natural sequence structure while enforcing causal dependencies through randomized ordering, enabling the effective capture of both sequential and bidirectional relationships. Comprehensive experiments show that BAD outperforms autoregressive and mask-based models in text-to-motion generation, suggesting a novel pre-training strategy for sequence modeling. The codebase for BAD is available on https://github.com/RohollahHS/BAD.
Paper Structure (10 sections, 2 equations, 3 figures, 3 tables)

This paper contains 10 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overall framework of our text-to-motion model. (a) Motion tokenizer, transforms a raw 3D motion sequence into a sequence of discrete motion tokens. (b) The conditional mask-based transformer reconstructs original discrete motion tokens from a corrupted sequence conditioned on a text prompt.
  • Figure 2: Examples of two different hybrid attention masks. $\mathbf{z}$ represents a random ordering $\mathbf{z} \sim \mathcal{Z}_{T}$, while $t$ denotes time. Each mask token attends to the last $T\!\!-\!p\!+\!\!1$ mask tokens $\mathbf{m}_{\mathbf{z} \geq p}$ and unmasked tokens. For example, orange cells indicate tokens that the third mask token, $m_{z_{3}}$, can attend to, including unmasked tokens and the existing $\mathbf{m}_{\mathbf{z} \geq 3}$ mask tokens.
  • Figure 3: Quality Comparison. (a) Visualization of generated motions from various models for the same prompt, with red circles indicating defects and green circles highlighting correct, natural motions. (b) Additional motions generated by BAD. (c) Visualization of temporal editing tasks.