Table of Contents
Fetching ...

FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

Yiyi Cai, Yuhan Wu, Kunhang Li, You Zhou, Bo Zheng, Haiyang Liu

TL;DR

FloodDiffusion introduces diffusion-forcing tailored for streaming text-driven motion generation, addressing latency and prompt-change challenges. It leverages a vectorized, lower-triangular time schedule, bi-directional attention, and time-varying text conditioning atop a latent diffusion backbone with a causal VAE and a DiT-style denoiser. The approach yields state-of-the-art streaming performance (FID 0.057 on HumanML3D) while maintaining competitive results with non-streaming methods, and it demonstrates robust ablations showing the criticality of its design choices. These results suggest diffusion-forcing can deliver high-quality, real-time motion generation with principled guarantees on distribution modeling.

Abstract

We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency. Unlike existing methods that rely on chunk-by-chunk or auto-regressive model with diffusion head, we adopt a diffusion forcing framework to model this time-series generation task under time-varying control events. We find that a straightforward implementation of vanilla diffusion forcing (as proposed for video models) fails to model real motion distributions. We demonstrate that to guarantee modeling the output distribution, the vanilla diffusion forcing must be tailored to: (i) train with a bi-directional attention instead of casual attention; (ii) implement a lower triangular time scheduler instead of a random one; (iii) utilize a continues time-varying way to introduce text conditioning. With these improvements, we demonstrate in the first time that the diffusion forcing-based framework achieves state-of-the-art performance on the streaming motion generation task, reaching an FID of 0.057 on the HumanML3D benchmark. Models, code, and weights are available. https://shandaai.github.io/FloodDiffusion/

FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

TL;DR

FloodDiffusion introduces diffusion-forcing tailored for streaming text-driven motion generation, addressing latency and prompt-change challenges. It leverages a vectorized, lower-triangular time schedule, bi-directional attention, and time-varying text conditioning atop a latent diffusion backbone with a causal VAE and a DiT-style denoiser. The approach yields state-of-the-art streaming performance (FID 0.057 on HumanML3D) while maintaining competitive results with non-streaming methods, and it demonstrates robust ablations showing the criticality of its design choices. These results suggest diffusion-forcing can deliver high-quality, real-time motion generation with principled guarantees on distribution modeling.

Abstract

We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency. Unlike existing methods that rely on chunk-by-chunk or auto-regressive model with diffusion head, we adopt a diffusion forcing framework to model this time-series generation task under time-varying control events. We find that a straightforward implementation of vanilla diffusion forcing (as proposed for video models) fails to model real motion distributions. We demonstrate that to guarantee modeling the output distribution, the vanilla diffusion forcing must be tailored to: (i) train with a bi-directional attention instead of casual attention; (ii) implement a lower triangular time scheduler instead of a random one; (iii) utilize a continues time-varying way to introduce text conditioning. With these improvements, we demonstrate in the first time that the diffusion forcing-based framework achieves state-of-the-art performance on the streaming motion generation task, reaching an FID of 0.057 on the HumanML3D benchmark. Models, code, and weights are available. https://shandaai.github.io/FloodDiffusion/

Paper Structure

This paper contains 40 sections, 8 theorems, 47 equations, 7 figures, 6 tables, 2 algorithms.

Key Result

Proposition 3.2

Given the vectorized time schedule, the conditional vector field and score function are: where $\odot$ denotes element-wise multiplication the division here is also element-wise.

Figures (7)

  • Figure 1: FloodDiffusion is a diffusion forcing based framework for streaming human motion generation. Given time-varying text prompts, such as "raise knees" followed by "squats", it generates smooth, continuous human motions aligned with the text. The framework natively handles prompt changes and does not require inference-time optimizations like explicit prompt refresh detection.
  • Figure 2: Pipeline Overview. FloodDiffusion is a latent diffusion based framework, the $263D$ motion stream is encoded to a compact $4D$ latent sequence via our causal VAE. Then the model predicts the velocity for the latent, $\hat{u}_t$, for the active window $m(t){:}n(t)$ conditioned on the context $0{:}n(t)$. The key designs are adding noise for the sequence according to lower-triangular time schedule and a Frame-wise text conditioning using an attention mask. During inference, we start from noise and slide the window, generating latent frames that are immediately decoded for streaming output.
  • Figure 3: Noise Schedule Comparison. Diffusion forcing samples a random schedule with uncertain active window and mismatches train--test schedule; Chunk diffusion denoises all frames within each chunk uniformly, incurring high response latency. Our triangular schedule denoises only the active window and advances at a constant per-frame rate.
  • Figure 4: Comparison of time-varying conditioning. Our model generates different resulting motions from the same text prompts based on their delivery timing. (Top Left) Prompts are given separately at different frames. (Top Right) All conditions are fed as a single prompt at once. (Bottom Left) Two separate prompts are input early in the sequence. (Bottom Right) The same two separate prompts are input later in the sequence.
  • Figure 5: Comparison of long sequence generation. (Left) our model will continue to repeat the motion in text prompt if without new prompts come. (Right) in real application, our model could stop current motion by explicitly giving the rest style prompt, such as "stand".
  • ...and 2 more figures

Theorems & Definitions (19)

  • Definition 3.1: Vectorized Time Schedule
  • Proposition 3.2: Vectorized Conditional Dynamics
  • Definition 3.3: Marginal Dynamics
  • Theorem 3.4: Conditional Generation
  • Definition 3.5: Active Window
  • Lemma 3.6: Schedule Saturation
  • Theorem 3.8: Streaming Locality
  • Remark 3.9: Why the triangle matters
  • Remark 3.10: Bidirectional attention is important
  • Proposition B.1: Vectorized Conditional Dynamics
  • ...and 9 more