Table of Contents
Fetching ...

Controllable Text-to-Motion Generation via Modular Body-Part Phase Control

Minyue Dai, Ke Fan, Anyi Rao, Jingbo Wang, Bo Dai

Abstract

Text-to-motion (T2M) generation is becoming a practical tool for animation and interactive avatars. However, modifying specific body parts while maintaining overall motion coherence remains challenging. Existing methods typically rely on cumbersome, high-dimensional joint constraints (e.g., trajectories), which hinder user-friendly, iterative refinement. To address this, we propose Modular Body-Part Phase Control, a plug-and-play framework enabling structured, localized editing via a compact, scalar-based phase interface. By modeling body-part latent motion channels as sinusoidal phase signals characterized by amplitude, frequency, phase shift, and offset, we extract interpretable codes that capture part-specific dynamics. A modular Phase ControlNet branch then injects this signal via residual feature modulation, seamlessly decoupling control from the generative backbone. Experiments on both diffusion- and flow-based models demonstrate that our approach provides predictable and fine-grained control over motion magnitude, speed, and timing. It preserves global motion coherence and offers a practical paradigm for controllable T2M generation. Project page: https://jixiii.github.io/bp-phase-project-page/

Controllable Text-to-Motion Generation via Modular Body-Part Phase Control

Abstract

Text-to-motion (T2M) generation is becoming a practical tool for animation and interactive avatars. However, modifying specific body parts while maintaining overall motion coherence remains challenging. Existing methods typically rely on cumbersome, high-dimensional joint constraints (e.g., trajectories), which hinder user-friendly, iterative refinement. To address this, we propose Modular Body-Part Phase Control, a plug-and-play framework enabling structured, localized editing via a compact, scalar-based phase interface. By modeling body-part latent motion channels as sinusoidal phase signals characterized by amplitude, frequency, phase shift, and offset, we extract interpretable codes that capture part-specific dynamics. A modular Phase ControlNet branch then injects this signal via residual feature modulation, seamlessly decoupling control from the generative backbone. Experiments on both diffusion- and flow-based models demonstrate that our approach provides predictable and fine-grained control over motion magnitude, speed, and timing. It preserves global motion coherence and offers a practical paradigm for controllable T2M generation. Project page: https://jixiii.github.io/bp-phase-project-page/
Paper Structure (19 sections, 11 equations, 4 figures, 3 tables)

This paper contains 19 sections, 11 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Our method enables localized motion control via body part phase. By scalar editing the phase parameters of a target body part, namely amplitude (A), frequency (F), and phase shift (S), we can directly modulate its motion magnitude, repetition pace, and temporal alignment in the generated sequence.
  • Figure 2: Overview of our modular body-part phase control framework. Given a reference motion, a frozen body-part phase extractor predicts per-part periodic parameters (AFSB) of each body part. A Phase ControlNet then injects multi-layer residuals into the backbone generator to produce motion latents aligned with the periodic parameters. The generated latent is finally decoded by a frozen motion VAE decoder to obtain the final motion. Users can interactively edit these parameters via simple scalar controls, which are converted to a phase manifold and encoded into a control embedding.
  • Figure 3: Control-response correlation curves. Scale response curves for (a) amplitude and (b) frequency control. The x-axis represents the explicit control scale factor applied by the user, and the y-axis denotes the measured effective ratio ($X'/X$) of the generated motion relative to the reference. The solid blue line represents the global mean computed via hierarchical aggregation across all test cases, while the shaded region indicates the standard deviation. The curves show a highly proportional linear correlation within the typical editing range ($0.5 \le x \le 1.5$). At extreme scales, they transition into a sub-linear regime with higher variance, reflecting the physical plausibility constraints of the generative prior.
  • Figure 4: Qualitative results of body-part motion editing. The middle row (d, e, f) displays the original motions generated from text prompts. The top and bottom rows show the results after applying our modular phase control. (a, d, g): Adjusting the shift ($S$) of the right arm to alter the timing of a scratching gesture. (b, e, h): Scaling the amplitude ($A$) of the right arm to control the magnitude of a waving action. (c, f, i): Scaling the frequency ($F$) of both legs to change the stepping pace of a walk. Our method precisely edits the target parts while keeping the rest of the body coherent.