FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

Yiyi Cai; Yuhan Wu; Kunhang Li; You Zhou; Bo Zheng; Haiyang Liu

FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

Yiyi Cai, Yuhan Wu, Kunhang Li, You Zhou, Bo Zheng, Haiyang Liu

TL;DR

FloodDiffusion introduces diffusion-forcing tailored for streaming text-driven motion generation, addressing latency and prompt-change challenges. It leverages a vectorized, lower-triangular time schedule, bi-directional attention, and time-varying text conditioning atop a latent diffusion backbone with a causal VAE and a DiT-style denoiser. The approach yields state-of-the-art streaming performance (FID 0.057 on HumanML3D) while maintaining competitive results with non-streaming methods, and it demonstrates robust ablations showing the criticality of its design choices. These results suggest diffusion-forcing can deliver high-quality, real-time motion generation with principled guarantees on distribution modeling.

Abstract

We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency. Unlike existing methods that rely on chunk-by-chunk or auto-regressive model with diffusion head, we adopt a diffusion forcing framework to model this time-series generation task under time-varying control events. We find that a straightforward implementation of vanilla diffusion forcing (as proposed for video models) fails to model real motion distributions. We demonstrate that to guarantee modeling the output distribution, the vanilla diffusion forcing must be tailored to: (i) train with a bi-directional attention instead of casual attention; (ii) implement a lower triangular time scheduler instead of a random one; (iii) utilize a continues time-varying way to introduce text conditioning. With these improvements, we demonstrate in the first time that the diffusion forcing-based framework achieves state-of-the-art performance on the streaming motion generation task, reaching an FID of 0.057 on the HumanML3D benchmark. Models, code, and weights are available. https://shandaai.github.io/FloodDiffusion/

FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

TL;DR

Abstract

FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (19)