Table of Contents
Fetching ...

FineXtrol: Controllable Motion Generation via Fine-Grained Text

Keming Shen, Bizhu Wu, Junliang Chen, Xiaoqin Wang, Linlin Shen

TL;DR

FineXtrol addresses the challenge of controllable text-driven motion generation by introducing fine-grained, temporally explicit textual controls for body-part movements. It employs a dual-branch diffusion framework with residual guidance from fine-grained text and a hierarchical contrastive learning module to produce discriminative embeddings for these signals. Empirical results on HumanML3D show strong controllability across multiple body parts and temporal intervals, with improved efficiency and reduced parameter count compared to coordinate-based methods. The approach yields higher realism and precision in motion generation and offers a user-friendly alternative to spatial control signals, enabling scalable, fine-grained manipulation of human motions in practical applications.

Abstract

Recent works have sought to enhance the controllability and precision of text-driven motion generation. Some approaches leverage large language models (LLMs) to produce more detailed texts, while others incorporate global 3D coordinate sequences as additional control signals. However, the former often introduces misaligned details and lacks explicit temporal cues, and the latter incurs significant computational cost when converting coordinates to standard motion representations. To address these issues, we propose FineXtrol, a novel control framework for efficient motion generation guided by temporally-aware, precise, user-friendly, and fine-grained textual control signals that describe specific body part movements over time. In support of this framework, we design a hierarchical contrastive learning module that encourages the text encoder to produce more discriminative embeddings for our novel control signals, thereby improving motion controllability. Quantitative results show that FineXtrol achieves strong performance in controllable motion generation, while qualitative analysis demonstrates its flexibility in directing specific body part movements.

FineXtrol: Controllable Motion Generation via Fine-Grained Text

TL;DR

FineXtrol addresses the challenge of controllable text-driven motion generation by introducing fine-grained, temporally explicit textual controls for body-part movements. It employs a dual-branch diffusion framework with residual guidance from fine-grained text and a hierarchical contrastive learning module to produce discriminative embeddings for these signals. Empirical results on HumanML3D show strong controllability across multiple body parts and temporal intervals, with improved efficiency and reduced parameter count compared to coordinate-based methods. The approach yields higher realism and precision in motion generation and offers a user-friendly alternative to spatial control signals, enabling scalable, fine-grained manipulation of human motions in practical applications.

Abstract

Recent works have sought to enhance the controllability and precision of text-driven motion generation. Some approaches leverage large language models (LLMs) to produce more detailed texts, while others incorporate global 3D coordinate sequences as additional control signals. However, the former often introduces misaligned details and lacks explicit temporal cues, and the latter incurs significant computational cost when converting coordinates to standard motion representations. To address these issues, we propose FineXtrol, a novel control framework for efficient motion generation guided by temporally-aware, precise, user-friendly, and fine-grained textual control signals that describe specific body part movements over time. In support of this framework, we design a hierarchical contrastive learning module that encourages the text encoder to produce more discriminative embeddings for our novel control signals, thereby improving motion controllability. Quantitative results show that FineXtrol achieves strong performance in controllable motion generation, while qualitative analysis demonstrates its flexibility in directing specific body part movements.

Paper Structure

This paper contains 47 sections, 10 equations, 14 figures, 18 tables, 2 algorithms.

Figures (14)

  • Figure 1: Illustrations of (A) Vanilla text-to-motion methods struggle to control specific body part movements. (B) Fine-grained text-to-motion approaches using LLMs' expanded descriptions for fine details, but often misalign with ground-truth motions and lack explicit temporal cues. (C) Spatially controllable motion generation methods rely on global 3D coordinate sequences as extra control signals, which are difficult to be provided beyond existing datasets and incur high computational costs from pose conversion. (D) Our FineXtrol introduces accurate and temporally explicit fine-grained textual control signals for specific body parts, enabling user-friendly and efficient controllable motion generation.
  • Figure 2: Overview of FineXtrol. Our framework takes the coarse-grained text $\boldsymbol{p}$, the fine-grained textual control signal $\boldsymbol{c}$, and a noise motion sequence $\boldsymbol{x_t}$ as input, and predicts the clean motion sequence $\boldsymbol{x_0}$. The lower branch resumes from MDM to maintain stable motion generation capabilities from $\boldsymbol{p}$. The upper branch is a trainable copy of MDM, modulated by $\boldsymbol{c}$ through conditional feature adaptation. The zero-initialized linear layers connect between branches. The framework ensures that the generated motion adheres not only to the coarse-grained text but also to the control signal.
  • Figure 3: A fine-grained textual control signal example.
  • Figure 4: Visualizations of different control settings. < Mask> is used for all unspecified temporal intervals.
  • Figure 5: The statistical results of the user study. The left pie chart displays the average preference ratio for the visualized motion sequences without fine-grained textual control signals (2 cases) of our FineXtrol and CoMo. The right one shows that with fine-grained textual control signals (6 cases). Each case is evaluated based on (1) alignment with control signals and (2) motion naturalness.
  • ...and 9 more figures