Table of Contents
Fetching ...

Causal Motion Diffusion Models for Autoregressive Motion Generation

Qing Yu, Akihisa Watanabe, Kent Fujiwara

TL;DR

Experiments on HumanML3D and SnapMoGen demonstrate that CMDM outperforms existing diffusion and autoregressive models in both semantic fidelity and temporal smoothness, while substantially reducing inference latency.

Abstract

Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence diffusion models with bidirectional generation, which limits temporal causality and real-time applicability, or autoregressive models that suffer from instability and cumulative errors. In this work, we present Causal Motion Diffusion Models (CMDM), a unified framework for autoregressive motion generation based on a causal diffusion transformer that operates in a semantically aligned latent space. CMDM builds upon a Motion-Language-Aligned Causal VAE (MAC-VAE), which encodes motion sequences into temporally causal latent representations. On top of this latent representation, an autoregressive diffusion transformer is trained using causal diffusion forcing to perform temporally ordered denoising across motion frames. To achieve fast inference, we introduce a frame-wise sampling schedule with causal uncertainty, where each subsequent frame is predicted from partially denoised previous frames. The resulting framework supports high-quality text-to-motion generation, streaming synthesis, and long-horizon motion generation at interactive rates. Experiments on HumanML3D and SnapMoGen demonstrate that CMDM outperforms existing diffusion and autoregressive models in both semantic fidelity and temporal smoothness, while substantially reducing inference latency.

Causal Motion Diffusion Models for Autoregressive Motion Generation

TL;DR

Experiments on HumanML3D and SnapMoGen demonstrate that CMDM outperforms existing diffusion and autoregressive models in both semantic fidelity and temporal smoothness, while substantially reducing inference latency.

Abstract

Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence diffusion models with bidirectional generation, which limits temporal causality and real-time applicability, or autoregressive models that suffer from instability and cumulative errors. In this work, we present Causal Motion Diffusion Models (CMDM), a unified framework for autoregressive motion generation based on a causal diffusion transformer that operates in a semantically aligned latent space. CMDM builds upon a Motion-Language-Aligned Causal VAE (MAC-VAE), which encodes motion sequences into temporally causal latent representations. On top of this latent representation, an autoregressive diffusion transformer is trained using causal diffusion forcing to perform temporally ordered denoising across motion frames. To achieve fast inference, we introduce a frame-wise sampling schedule with causal uncertainty, where each subsequent frame is predicted from partially denoised previous frames. The resulting framework supports high-quality text-to-motion generation, streaming synthesis, and long-horizon motion generation at interactive rates. Experiments on HumanML3D and SnapMoGen demonstrate that CMDM outperforms existing diffusion and autoregressive models in both semantic fidelity and temporal smoothness, while substantially reducing inference latency.
Paper Structure (43 sections, 13 equations, 7 figures, 11 tables)

This paper contains 43 sections, 13 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Overview of the existing methods and the proposed method. Existing diffusion-based methods (left) perform full-sequence denoising using the same noise level across all frames. In contrast, our proposed CMDM (right) introduces a causal diffusion forcing mechanism that operates on semantic causal latent features with frame-wise noise levels.
  • Figure 2: Overview of the proposed CMDM framework. CMDM consists of three key components: (a) MAC-VAE, which encodes motion sequences into motion–language–aligned and temporally causal latent features using a causal encoder–decoder structure supervised by motion-language model alignment; (b) Causal-DiT, which performs diffusion denoising with causal self-attention and cross-attention to text embeddings, ensuring temporally ordered and semantically consistent frame refinement; and (c) Causal Diffusion Forcing, which applies independent frame-level noise during training and a causal uncertainty schedule during inference, where the redness intensity represents the noise level. This design enables CMDM to achieve temporally consistent, semantically aligned, and efficient text-to-motion generation suitable for streaming and long-horizon synthesis.
  • Figure 3: Qualitative results of long-horizon motion generation. Comparison between our CMDM and previous methods. The generated motion is continuous and seamless; for visualization purposes, we split each long sequence into shorter segments corresponding to their captions. Please refer to the videos in the supplementary materials for the complete motion sequences.
  • Figure 4: Qualitative results of long-horizon motion generation on HumanML3D. Comparison between our CMDM and previous methods. The generated motion is continuous and seamless; for visualization purposes, we split each long sequence into shorter segments corresponding to their captions. Please refer to the videos in the supplementary materials for the complete motion sequences.
  • Figure 5: Qualitative results of long-horizon motion generation on SnapMoGen. Comparison between our CMDM and previous methods. The generated motion is continuous and seamless; for visualization purposes, we split each long sequence into shorter segments corresponding to their captions. Please refer to the videos in the supplementary materials for the complete motion sequences.
  • ...and 2 more figures