Table of Contents
Fetching ...

MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

Lixing Xiao, Shunlin Lu, Huaijin Pi, Ke Fan, Liang Pan, Yueer Zhou, Ziyong Feng, Xiaowei Zhou, Sida Peng, Jingbo Wang

TL;DR

MotionStreamer tackles text-conditioned streaming motion generation by first encoding motions into a continuous causal latent space via a Causal Temporal AutoEncoder, then employing a diffusion-headed autoregressive transformer that conditions on text and past latents. Two training strategies (Two-Forward and Mixed) mitigate exposure bias and enable robust long-horizon generation, while a continuous stopping condition via an impossible end latent supports automatic online termination. The approach achieves state-of-the-art results on HumanML3D T2M and BABEL and supports multi-round, long-term, and dynamic motion composition with improved online latency. The work introduces a practical, scalable framework for real-time, text-driven character animation with continuous latents, avoiding information loss from discretization and enabling online decoding.

Abstract

This paper addresses the challenge of text-conditioned streaming motion generation, which requires us to predict the next-step human pose based on variable-length historical motions and incoming texts. Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation problem due to discretized non-causal tokenization. To solve these problems, we propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model. The continuous latents mitigate information loss caused by discretization and effectively reduce error accumulation during long-term autoregressive generation. In addition, by establishing temporal causal dependencies between current and historical motion latents, our model fully utilizes the available information to achieve accurate online motion decoding. Experiments show that our method outperforms existing approaches while offering more applications, including multi-round generation, long-term generation, and dynamic motion composition. Project Page: https://zju3dv.github.io/MotionStreamer/

MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

TL;DR

MotionStreamer tackles text-conditioned streaming motion generation by first encoding motions into a continuous causal latent space via a Causal Temporal AutoEncoder, then employing a diffusion-headed autoregressive transformer that conditions on text and past latents. Two training strategies (Two-Forward and Mixed) mitigate exposure bias and enable robust long-horizon generation, while a continuous stopping condition via an impossible end latent supports automatic online termination. The approach achieves state-of-the-art results on HumanML3D T2M and BABEL and supports multi-round, long-term, and dynamic motion composition with improved online latency. The work introduces a practical, scalable framework for real-time, text-driven character animation with continuous latents, avoiding information loss from discretization and enabling online decoding.

Abstract

This paper addresses the challenge of text-conditioned streaming motion generation, which requires us to predict the next-step human pose based on variable-length historical motions and incoming texts. Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation problem due to discretized non-causal tokenization. To solve these problems, we propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model. The continuous latents mitigate information loss caused by discretization and effectively reduce error accumulation during long-term autoregressive generation. In addition, by establishing temporal causal dependencies between current and historical motion latents, our model fully utilizes the available information to achieve accurate online motion decoding. Experiments show that our method outperforms existing approaches while offering more applications, including multi-round generation, long-term generation, and dynamic motion composition. Project Page: https://zju3dv.github.io/MotionStreamer/

Paper Structure

This paper contains 19 sections, 9 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Visualization of streaming motion generation process. Texts are incrementally inputted and motions are generated online.
  • Figure 2: Overview of MotionStreamer. During inference, the AR model streamingly predicts next motion latents conditioned on the current text and previous motion latents. Each latent can be decoded into motion frames online as soon as it is generated.
  • Figure 3: Architecture of Causal TAE. 1D temporal causal convolution is applied in both the encoder and decoder. Variables $z_{1:n}$ are sampled as continuous motion latent representations.
  • Figure 4: Comparison on the First-frame Latency of different methods. Horizontal axis: the number of generated frames. Vertical axis: the time required to produce the first output frame.
  • Figure 5: Visualization results between our method and some baseline methods T2M-GPTmmaskattt2mFlowMDM. The first row shows text-to-motion generation results, the second row shows long-term generation results and the third row shows the application of dynamic motion composition.
  • ...and 5 more figures