PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition

Zeyu Ling; Qing Shuai; Teng Zhang; Shiyang Li; Bo Han; Changqing Zou

PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition

Zeyu Ling, Qing Shuai, Teng Zhang, Shiyang Li, Bo Han, Changqing Zou

TL;DR

A single motion generation foundation model is trained that seamlessly handles text-to-motion, pose-conditioned generation, autoregressive sequential generation, and narrative motion composition, achieving state-of-the-art on HumanML3D, MotionHub, BABEL, and a 50-scenario user study.

Abstract

Text-to-motion generation has advanced rapidly, yet two challenges persist. First, existing motion autoencoders compress each frame into a single monolithic latent vector, entangling trajectory and per-joint rotations in an unstructured representation that downstream generators struggle to model faithfully. Second, text-to-motion, pose-conditioned generation, and long-horizon sequential synthesis typically require separate models or task-specific mechanisms, with autoregressive approaches suffering from severe error accumulation over extended rollouts. We present PRISM, addressing each challenge with a dedicated contribution. (1) A joint-factorized motion latent space: each body joint occupies its own token, forming a structured 2D grid (time joints) compressed by a causal VAE with forward-kinematics supervision. This simple change to the latent space -- without modifying the generator -- substantially improves generation quality, revealing that latent space design has been an underestimated bottleneck. (2) Noise-free condition injection: each latent token carries its own timestep embedding, allowing conditioning frames to be injected as clean tokens (timestep0) while the remaining tokens are denoised. This unifies text-to-motion and pose-conditioned generation in a single model, and directly enables autoregressive segment chaining for streaming synthesis. Self-forcing training further suppresses drift in long rollouts. With these two components, we train a single motion generation foundation model that seamlessly handles text-to-motion, pose-conditioned generation, autoregressive sequential generation, and narrative motion composition, achieving state-of-the-art on HumanML3D, MotionHub, BABEL, and a 50-scenario user study.

PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition

TL;DR

Abstract

Paper Structure (42 sections, 2 equations, 4 figures, 9 tables)

This paper contains 42 sections, 2 equations, 4 figures, 9 tables.

Introduction
Related Work
Text-to-motion generation.
Motion latent representations.
Long-horizon and streaming motion generation.
Unified motion generation models.
Method
Causal Joint-Factorized VAE
Joint-factorized input.
Encoder--decoder architecture.
$xz$-plane data augmentation.
Forward-kinematics (FK) supervision.
Latent Flow-Matching DiT with Noise-Free Condition Injection
Flow-matching training.
Noise-free condition injection.
...and 27 more sections

Figures (4)

Figure 1: A single PRISM model handles text-to-motion, pose-conditioned generation, sequential synthesis, and narrative motion composition, producing coherent streaming motion with 10+ segments far beyond the training horizon.
Figure 2: Overview of PRISM.(a) A causal joint-factorized VAE compresses per-joint SMPL tokens into a structured 2D latent grid. (b) A flow-matching DiT denoises the grid with per-token timestep embeddings, unifying T2M, pose-conditioned generation, and autoregressive streaming via noise-free condition injection. Self-forcing suppresses drift over long rollouts.
Figure 3: Qualitative comparison on text-to-motion. PRISM produces smoother, physically plausible motions with less jitter and foot sliding than baselines.
Figure 4: Qualitative comparison on long-horizon narrative composition. PRISM follows all sub-actions with smooth transitions, while MotionStreamer misses several actions and exhibits drift.

PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition

TL;DR

Abstract

PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition

Authors

TL;DR

Abstract

Table of Contents

Figures (4)