Text-controlled Motion Mamba: Text-Instructed Temporal Grounding of Human Motion

Xinghan Wang; Zixi Kang; Yadong Mu

Text-controlled Motion Mamba: Text-Instructed Temporal Grounding of Human Motion

Xinghan Wang, Zixi Kang, Yadong Mu

TL;DR

This work introduces text-based human motion grounding (THMG), aiming to localize temporal segments in untrimmed 3D motion sequences using natural language queries. It presents TM-Mamba, a unified, linear-memory model that fuses global temporal context, text-conditioned control, and skeletal topology via relational embeddings to ground motion segments efficiently. To evaluate THMG, the authors construct BABEL-Grounding, a dataset with detailed textual descriptions and ground-truth temporal segments, and demonstrate state-of-the-art grounding performance with extensive ablations and baselines. The approach offers scalable, accurate grounding suitable for long sequences and real-world applications requiring text-guided temporal localization in motion data.

Abstract

Human motion understanding is a fundamental task with diverse practical applications, facilitated by the availability of large-scale motion capture datasets. Recent studies focus on text-motion tasks, such as text-based motion generation, editing and question answering. In this study, we introduce the novel task of text-based human motion grounding (THMG), aimed at precisely localizing temporal segments corresponding to given textual descriptions within untrimmed motion sequences. Capturing global temporal information is crucial for the THMG task. However, Transformer-based models that rely on global temporal self-attention face challenges when handling long untrimmed sequences due to the quadratic computational cost. We address these challenges by proposing Text-controlled Motion Mamba (TM-Mamba), a unified model that integrates temporal global context, language query control, and spatial graph topology with only linear memory cost. The core of the model is a text-controlled selection mechanism which dynamically incorporates global temporal information based on text query. The model is further enhanced to be topology-aware through the integration of relational embeddings. For evaluation, we introduce BABEL-Grounding, the first text-motion dataset that provides detailed textual descriptions of human actions along with their corresponding temporal segments. Extensive evaluations demonstrate the effectiveness of TM-Mamba on BABEL-Grounding.

Text-controlled Motion Mamba: Text-Instructed Temporal Grounding of Human Motion

TL;DR

Abstract

Paper Structure (29 sections, 1 theorem, 5 equations, 12 figures, 5 tables, 2 algorithms)

This paper contains 29 sections, 1 theorem, 5 equations, 12 figures, 5 tables, 2 algorithms.

Introduction
Related work
Datasets for Text-Motion Learning
Text-Motion Multi-modal Learning
State Space Model
BABEL-Grounding Dataset
Textual augmentation
Utilizing external annotations
Template-based augmentation
Temporal Augmentation
Time windows merging
one-to-many mapping
Annotation Quality
Automatic Evaluation
Manual Evaluation
...and 14 more sections

Key Result

Lemma 1

When $N = 1, \mathbf{A} = -1, \mathbf{B}=1$, the text-controlled selection mechanism takes the form of $g_t = \sigma(Linear_{\Delta}(X, q))$ and $h_t = (1-g_t) h_{t-1} + g_t h_t$, where X denotes input sequence and q denotes query embedding.

Figures (12)

Figure 1: Illustration of the Text-based Human Motion Grounding (TMHG) task and samples of the proposed BABEL-Grounding dataset. Best viewed in color.
Figure 2: Dataset statistics of BABEL-Grounding. 'Frame Number' refers to the length of motion sequences. 'Text Query Length' denotes the length of textual annotations in the data. 'Grounded Length Ratio' indicates the ratio of the length of temporal segments corresponding to each text query to the total length of the sequence. 'Segment Counts per Query' refers to the number of temporal segments corresponding to each text query.
Figure 3: An illustration of the data augmentation pipeline, highlighting the differences between the original BABEL annotations and the BABEL-Grounding annotations.
Figure 4: Evaluation results of annotation quality. $d = \frac{l}{3}$ measures the shifting offset relative to the time window length $l$. BABEL-Grounding achieves comparable matching accuracy to BABEL at zero offset, while demonstrating greater sensitivity to temporal shifts due to its richer information in text annotations.
Figure 5: Results of dataset evaluation. The left side presents the frequency of scores ranging from 1 to 5, while the right side presents the cumulative percentages.
...and 7 more figures

Theorems & Definitions (1)

Lemma 1

Text-controlled Motion Mamba: Text-Instructed Temporal Grounding of Human Motion

TL;DR

Abstract

Text-controlled Motion Mamba: Text-Instructed Temporal Grounding of Human Motion

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (1)