MLP: Motion Label Prior for Temporal Sentence Localization in Untrimmed 3D Human Motions

Sheng Yan; Mengyuan Liu; Yong Wang; Yang Liu; Chen Chen; Hong Liu

MLP: Motion Label Prior for Temporal Sentence Localization in Untrimmed 3D Human Motions

Sheng Yan, Mengyuan Liu, Yong Wang, Yang Liu, Chen Chen, Hong Liu

TL;DR

This work tackles temporal sentence localization in untrimmed 3D motions (TSLM), a task hampered by low contextual richness and frame ambiguity. It introduces Motion Label Prior (MLP), a span-based model with a factorized Spatial+Temporal Encoder, cross-modal Fusion, and two label-prior modules: Label-Prior Sequence Matcher (LP-Matcher) and Label-Prior Span Predictor (LP-Predictor), to inject foreground/background priors and align predictions via recovery-based training. The authors construct a TSLM benchmark and demonstrate state-of-the-art performance on BABEL and HumanML3D (Restore), with strong results under high IoU thresholds and notable gains from the label-prior frameworks. They also show practical applicability to corpus-level moment retrieval by coupling MLP with a retrieval model, highlighting real-world search and animation implications for large 3D-motion corpora.

Abstract

In this paper, we address the unexplored question of temporal sentence localization in human motions (TSLM), aiming to locate a target moment from a 3D human motion that semantically corresponds to a text query. Considering that 3D human motions are captured using specialized motion capture devices, motions with only a few joints lack complex scene information like objects and lighting. Due to this character, motion data has low contextual richness and semantic ambiguity between frames, which limits the accuracy of predictions made by current video localization frameworks extended to TSLM to only a rough level. To refine this, we devise two novel label-prior-assisted training schemes: one embed prior knowledge of foreground and background to highlight the localization chances of target moments, and the other forces the originally rough predictions to overlap with the more accurate predictions obtained from the flipped start/end prior label sequences during recovery training. We show that injecting label-prior knowledge into the model is crucial for improving performance at high IoU. In our constructed TSLM benchmark, our model termed MLP achieves a recall of 44.13 at IoU@0.7 on the BABEL dataset and 71.17 on HumanML3D (Restore), outperforming prior works. Finally, we showcase the potential of our approach in corpus-level moment retrieval. Our source code is openly accessible at https://github.com/eanson023/mlp.

MLP: Motion Label Prior for Temporal Sentence Localization in Untrimmed 3D Human Motions

TL;DR

Abstract

Paper Structure (18 sections, 13 equations, 9 figures, 8 tables)

This paper contains 18 sections, 13 equations, 9 figures, 8 tables.

Introduction
Related Work
Method
Task Definition
MLP Model Architecture
Motion Encoder
Cross-modal Fusion
Label-Prior Sequence Matcher (LP-Matcher)
Label-Prior Span Predictor (LP-Predictor)
Training Strategy
Experiments
Dataset and Evaluation Protocol
Implementation Details
A new benchmark & Comparison to prior work
Ablation Study & In-depth Analysis
...and 3 more sections

Figures (9)

Figure 1: (a) An illustration of the TSLM task. (b) Standard span-based TSLV framework zhang2020span, designed to predict the probability $\mathcal{P}_{s/e}$ for each motion frame as the starting/ending position of the target moment. (c) Our proposed MLP injects label-prior knowledge into the two components of (b), corresponding to the Label-Prior Sequence Matcher (LP-Matcher) and the Label-Prior Span Predictor (LP-Predictor).
Figure 2: Comparison of performances between span-based TSLV method and ours at the high intersection over union (IoU) on BABEL dataset. (X-axis: different IoU thresholds; Y-axis: number of located samples)
Figure 3: Our proposed MLP: (a) For each $(\mathbf{M}, \mathbf{Q})$ pair, we build a factorised Spatial+Temporal Encoder to extract motion features $\mathbf{\bar{M}}$. Then, the Temporal Encoder (T-Enc) is shared to the linguistics modality to obtain text features $\mathbf{\bar{Q}}$. (b) Following Cross-modal Fusion, our Label-Prior Sequence Matcher embeds prior foreground-background knowledge ($\mathcal{E}_{\text{emb}}$) into the query-attended features $\mathbf{\bar{M}^{q}}$, which helps optimize $\mathcal{L}_{\text{Seq}}$ by pre-informs who is the foreground or background. (c) In the Label-Prior Span Predictor, we design a Predicting part and a Recovering part in parallel. The optimization goal is the same for both parts; however, the Recovering part uses the label-flipped prior knowledge (composed of start/non-start or end/non-end label embeddings, $\mathcal{\bar{E}}_{\text{s/e}}$) to perform recovering training.
Figure 4: Qualitative grounding results on BABEL test set. We visualize the start/end probabilities $\mathcal{P}_{\text{s/e}}$ (Eq. \ref{['eq:infer']}) of MLPBase and MLP, and the highlight score $\mathcal{S}_{\text{LP/SM}}$ (Sec. \ref{['subsubsec:lpmatcher']}), on the right. It can be observed that in the foreground region (yellow background), the $\mathcal{S}_{\text{LP}}$ of MLP is generally higher than the $\mathcal{S}_{\text{SM}}$ of MLPBase.
Figure 5: Qualitative grounding results on HumanML3D (Restore) test set. We visualize the probability distributions of the two parts in the LP-Predictor (Sec. \ref{['subsubsec:lppredictor']}). Please focus on the red rectangle on the right. It can be observed that, compared to $\mathcal{P}_{\text{s}}$ ( MLPBase), $\mathcal{P}_{\text{s}}$ ( MLP) is closer to $\mathcal{P}_{\text{s}}^{\text{rec}}$. This indicates that the predictions of MLP, which incorporates prior knowledge, are more accurate.
...and 4 more figures

MLP: Motion Label Prior for Temporal Sentence Localization in Untrimmed 3D Human Motions

TL;DR

Abstract

MLP: Motion Label Prior for Temporal Sentence Localization in Untrimmed 3D Human Motions

Authors

TL;DR

Abstract

Table of Contents

Figures (9)