Table of Contents
Fetching ...

FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis

Ke Fan, Junshu Tang, Weijian Cao, Ran Yi, Moran Li, Jingyu Gong, Jiangning Zhang, Yabiao Wang, Chengjie Wang, Lizhuang Ma

TL;DR

FreeMotion addresses the universality gap in text-to-motion by factorizing the joint multi-person motion distribution into conditional single-person motions, enabling number-free synthesis via recursive generation: $p(\boldsymbol{x}^1,...,\boldsymbol{x}^n)=p(\boldsymbol{x}^1)\prod_{i=1}^{n-1}p(\boldsymbol{x}^{i+1}|\boldsymbol{x}^1,...,\boldsymbol{x}^i)$. It introduces a decoupled Generation Module for single-motion synthesis and an Interaction Module to inject conditioning from other individuals, augmented by both explicit and implicit spatial guidance for multi-person control. Training proceeds in two stages, using an LLM to create per-person prompts and a ControlNet-inspired conditioning scheme that preserves generation quality while enabling interactions. On InterHuman, FreeMotion achieves state-of-the-art results across multiple metrics for both single- and multi-person scenarios, and demonstrates coherent three-person motion and robust spatial control, highlighting practical applicability to arbitrary numbers of agents.

Abstract

Text-to-motion synthesis is a crucial task in computer vision. Existing methods are limited in their universality, as they are tailored for single-person or two-person scenarios and can not be applied to generate motions for more individuals. To achieve the number-free motion synthesis, this paper reconsiders motion generation and proposes to unify the single and multi-person motion by the conditional motion distribution. Furthermore, a generation module and an interaction module are designed for our FreeMotion framework to decouple the process of conditional motion generation and finally support the number-free motion synthesis. Besides, based on our framework, the current single-person motion spatial control method could be seamlessly integrated, achieving precise control of multi-person motion. Extensive experiments demonstrate the superior performance of our method and our capability to infer single and multi-human motions simultaneously.

FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis

TL;DR

FreeMotion addresses the universality gap in text-to-motion by factorizing the joint multi-person motion distribution into conditional single-person motions, enabling number-free synthesis via recursive generation: . It introduces a decoupled Generation Module for single-motion synthesis and an Interaction Module to inject conditioning from other individuals, augmented by both explicit and implicit spatial guidance for multi-person control. Training proceeds in two stages, using an LLM to create per-person prompts and a ControlNet-inspired conditioning scheme that preserves generation quality while enabling interactions. On InterHuman, FreeMotion achieves state-of-the-art results across multiple metrics for both single- and multi-person scenarios, and demonstrates coherent three-person motion and robust spatial control, highlighting practical applicability to arbitrary numbers of agents.

Abstract

Text-to-motion synthesis is a crucial task in computer vision. Existing methods are limited in their universality, as they are tailored for single-person or two-person scenarios and can not be applied to generate motions for more individuals. To achieve the number-free motion synthesis, this paper reconsiders motion generation and proposes to unify the single and multi-person motion by the conditional motion distribution. Furthermore, a generation module and an interaction module are designed for our FreeMotion framework to decouple the process of conditional motion generation and finally support the number-free motion synthesis. Besides, based on our framework, the current single-person motion spatial control method could be seamlessly integrated, achieving precise control of multi-person motion. Extensive experiments demonstrate the superior performance of our method and our capability to infer single and multi-human motions simultaneously.
Paper Structure (21 sections, 5 equations, 5 figures, 3 tables)

This paper contains 21 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The left shows our model can generate controllable motions for any number (1--4 from the figure) of individuals. Different colors represent the different person's motion. The right is an illustration of our new paradigm of motion generation, recursive generation, where every single motion is predicted under the condition of the motions generated before. Best viewed in color.
  • Figure 2: Overall architecture of FreeMotion, which contains a generation module and an interaction module. Given a text $\mathbf{d}$, our framework can infer a motion $x^{1}$ by the generation module independently, or under the condition of multiple motions $x^{2}, x^{3}...$ or some spatial guidance $\mathbf{s}$. Red line represents the implicit guidance of the spatial control signal.
  • Figure 3: Comparison with Intergen* on single and two-person motion generation. For single-person motion, we generate it with our re-annotated single description. For two-person motion, we further leverage the original interactive descriptions. For better visualization, some pose frames are shifted to prevent complete overlap.
  • Figure 4: Qualitative results for generating three-person motions. We manually design some text prompts and feed them to our network for motion generation. For better visualization, some pose frames are slightly shifted to prevent completed overlap.
  • Figure 5: Results of multi-person spatial control. We manually design some text prompts as well as the trajectories and leverage the integrated spatial control module to generate the results.