Table of Contents
Fetching ...

Unified Number-Free Text-to-Motion Generation Via Flow Matching

Guanhe Huang, Oya Celiktutan

Abstract

Generative models excel at motion synthesis for a fixed number of agents but struggle to generalize with variable agents. Based on limited, domain-specific data, existing methods employ autoregressive models to generate motion recursively, which suffer from inefficiency and error accumulation. We propose Unified Motion Flow (UMF), which consists of Pyramid Motion Flow (P-Flow) and Semi-Noise Motion Flow (S-Flow). UMF decomposes the number-free motion generation into a single-pass motion prior generation stage and multi-pass reaction generation stages. Specifically, UMF utilizes a unified latent space to bridge the distribution gap between heterogeneous motion datasets, enabling effective unified training. For motion prior generation, P-Flow operates on hierarchical resolutions conditioned on different noise levels, thereby mitigating computational overheads. For reaction generation, S-Flow learns a joint probabilistic path that adaptively performs reaction transformation and context reconstruction, alleviating error accumulation. Extensive results and user studies demonstrate UMF' s effectiveness as a generalist model for multi-person motion generation from text. Project page: https://githubhgh.github.io/umf/.

Unified Number-Free Text-to-Motion Generation Via Flow Matching

Abstract

Generative models excel at motion synthesis for a fixed number of agents but struggle to generalize with variable agents. Based on limited, domain-specific data, existing methods employ autoregressive models to generate motion recursively, which suffer from inefficiency and error accumulation. We propose Unified Motion Flow (UMF), which consists of Pyramid Motion Flow (P-Flow) and Semi-Noise Motion Flow (S-Flow). UMF decomposes the number-free motion generation into a single-pass motion prior generation stage and multi-pass reaction generation stages. Specifically, UMF utilizes a unified latent space to bridge the distribution gap between heterogeneous motion datasets, enabling effective unified training. For motion prior generation, P-Flow operates on hierarchical resolutions conditioned on different noise levels, thereby mitigating computational overheads. For reaction generation, S-Flow learns a joint probabilistic path that adaptively performs reaction transformation and context reconstruction, alleviating error accumulation. Extensive results and user studies demonstrate UMF' s effectiveness as a generalist model for multi-person motion generation from text. Project page: https://githubhgh.github.io/umf/.

Paper Structure

This paper contains 17 sections, 16 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Core contribution of UMF. We show dual-agent cases here for simplicity. (a) Standard methods tevet2022humanwang2025timotion are restricted to a fixed number of agents. (b) Autoregressive methods fan2025freemotion decouple generation into a motion prior and subsequent reaction. The reaction is typically guided by the prior using a conditioning network. (c) Our UMF leverages a heterogeneous motion prior as the adaptive start point of the reaction flow path, mitigating error accumulation.
  • Figure 2: Overview of the Unified Motion Flow (UMF) architecture. The UMF framework consists of three stages. (A) Unified motion VAE: A motion VAE with latent adapters encodes raw motions from heterogeneous datasets (e.g., HumanML3D guo2022generating, InterHuman liang2024intergen) into a regularized multi-token latent representation ($Z$). (B) P-Flow motion prior generation: The Pyramid Flow Transformer synthesizes the latent motion prior ($\check{Z}$) based on noisy latent motion and text conditions. The P-Flow operates hierarchically based on the timestep $t \sim (0, 1)$: it processes downsampled, low-resolution latents for $t<p$ and switches to original-resolution latents for $t>p$, mitigating multi-token computational overheads. (C) S-Flow reaction generation: Based on the previously generated latent {$\check{Z}_i, \dots, \check{Z}_1$}, the context adapter generates the context motion $C$. Then the Semi-Noise Flow transformer predicts the reaction latent ($\check{W}$) by jointly modeling context reconstruction and reaction transformation, alleviating the error accumulation from previously generated motion.
  • Figure 3: Qualitative comparison (zoom into see it better) between FreeMotion fan2025freemotion and UMF. Red circles demonstrate successful cases, while Blue circles show failure cases.
  • Figure 4: The UMF number-free zero-shot generation user study. We asked users to compare our UMF (Blue Bar) to the FreeMotion (Red Bar) in a side-by-side view. The dashed line marks 50%. UMF outperforms FreeMotion in all three aspects of generation.