MLP: Motion Label Prior for Temporal Sentence Localization in Untrimmed 3D Human Motions
Sheng Yan, Mengyuan Liu, Yong Wang, Yang Liu, Chen Chen, Hong Liu
TL;DR
This work tackles temporal sentence localization in untrimmed 3D motions (TSLM), a task hampered by low contextual richness and frame ambiguity. It introduces Motion Label Prior (MLP), a span-based model with a factorized Spatial+Temporal Encoder, cross-modal Fusion, and two label-prior modules: Label-Prior Sequence Matcher (LP-Matcher) and Label-Prior Span Predictor (LP-Predictor), to inject foreground/background priors and align predictions via recovery-based training. The authors construct a TSLM benchmark and demonstrate state-of-the-art performance on BABEL and HumanML3D (Restore), with strong results under high IoU thresholds and notable gains from the label-prior frameworks. They also show practical applicability to corpus-level moment retrieval by coupling MLP with a retrieval model, highlighting real-world search and animation implications for large 3D-motion corpora.
Abstract
In this paper, we address the unexplored question of temporal sentence localization in human motions (TSLM), aiming to locate a target moment from a 3D human motion that semantically corresponds to a text query. Considering that 3D human motions are captured using specialized motion capture devices, motions with only a few joints lack complex scene information like objects and lighting. Due to this character, motion data has low contextual richness and semantic ambiguity between frames, which limits the accuracy of predictions made by current video localization frameworks extended to TSLM to only a rough level. To refine this, we devise two novel label-prior-assisted training schemes: one embed prior knowledge of foreground and background to highlight the localization chances of target moments, and the other forces the originally rough predictions to overlap with the more accurate predictions obtained from the flipped start/end prior label sequences during recovery training. We show that injecting label-prior knowledge into the model is crucial for improving performance at high IoU. In our constructed TSLM benchmark, our model termed MLP achieves a recall of 44.13 at IoU@0.7 on the BABEL dataset and 71.17 on HumanML3D (Restore), outperforming prior works. Finally, we showcase the potential of our approach in corpus-level moment retrieval. Our source code is openly accessible at https://github.com/eanson023/mlp.
