Table of Contents
Fetching ...

Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training

Yunjiao Zhou, Xinyan Chen, Junlang Qian, Lihua Xie, Jianfei Yang

TL;DR

ZOMG tackles the problem of zero-shot, open-vocabulary motion grounding without segment annotations. It introduces a two-stage framework that first relies on Motion-Language Pretraining to capture global motion-language semantics and then applies per-instance Test-Time Grounding with Language Semantic Partition and Soft Masking Optimization to obtain fine-grained, temporally grounded sub-actions. The method uses LLM-based decomposition for semantic anchors and learnable framewise masks for segment localization, optimized with a combined objective that includes intra-sequence contrastive terms and regularizers for mask exclusivity and smoothness. On three motion-language benchmarks, ZOMG achieves state-of-the-art grounding (eg, up to +$8.7\%$ mAP on HumanML3D) and boosts downstream motion-text retrieval, while remaining highly efficient with ~0.5K trainable parameters and substantial inference speedups, enabling practical annotation-free deployment and scalable interpretation of complex actions.

Abstract

Understanding complex human activities demands the ability to decompose motion into fine-grained, semantic-aligned sub-actions. This motion grounding process is crucial for behavior analysis, embodied AI and virtual reality. Yet, most existing methods rely on dense supervision with predefined action classes, which are infeasible in open-vocabulary, real-world settings. In this paper, we propose ZOMG, a zero-shot, open-vocabulary framework that segments motion sequences into semantically meaningful sub-actions without requiring any annotations or fine-tuning. Technically, ZOMG integrates (1) language semantic partition, which leverages large language models to decompose instructions into ordered sub-action units, and (2) soft masking optimization, which learns instance-specific temporal masks to focus on frames critical to sub-actions, while maintaining intra-segment continuity and enforcing inter-segment separation, all without altering the pretrained encoder. Experiments on three motion-language datasets demonstrate state-of-the-art effectiveness and efficiency of motion grounding performance, outperforming prior methods by +8.7\% mAP on HumanML3D benchmark. Meanwhile, significant improvements also exist in downstream retrieval, establishing a new paradigm for annotation-free motion understanding.

Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training

TL;DR

ZOMG tackles the problem of zero-shot, open-vocabulary motion grounding without segment annotations. It introduces a two-stage framework that first relies on Motion-Language Pretraining to capture global motion-language semantics and then applies per-instance Test-Time Grounding with Language Semantic Partition and Soft Masking Optimization to obtain fine-grained, temporally grounded sub-actions. The method uses LLM-based decomposition for semantic anchors and learnable framewise masks for segment localization, optimized with a combined objective that includes intra-sequence contrastive terms and regularizers for mask exclusivity and smoothness. On three motion-language benchmarks, ZOMG achieves state-of-the-art grounding (eg, up to + mAP on HumanML3D) and boosts downstream motion-text retrieval, while remaining highly efficient with ~0.5K trainable parameters and substantial inference speedups, enabling practical annotation-free deployment and scalable interpretation of complex actions.

Abstract

Understanding complex human activities demands the ability to decompose motion into fine-grained, semantic-aligned sub-actions. This motion grounding process is crucial for behavior analysis, embodied AI and virtual reality. Yet, most existing methods rely on dense supervision with predefined action classes, which are infeasible in open-vocabulary, real-world settings. In this paper, we propose ZOMG, a zero-shot, open-vocabulary framework that segments motion sequences into semantically meaningful sub-actions without requiring any annotations or fine-tuning. Technically, ZOMG integrates (1) language semantic partition, which leverages large language models to decompose instructions into ordered sub-action units, and (2) soft masking optimization, which learns instance-specific temporal masks to focus on frames critical to sub-actions, while maintaining intra-segment continuity and enforcing inter-segment separation, all without altering the pretrained encoder. Experiments on three motion-language datasets demonstrate state-of-the-art effectiveness and efficiency of motion grounding performance, outperforming prior methods by +8.7\% mAP on HumanML3D benchmark. Meanwhile, significant improvements also exist in downstream retrieval, establishing a new paradigm for annotation-free motion understanding.

Paper Structure

This paper contains 27 sections, 10 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Motion grounding illustration of ZOMG.
  • Figure 2: Overall framework of ZOMG.
  • Figure 3: Motion grounding comparison in HumanML3D.
  • Figure 4: Analysis of ZOMG in (Top) mask heatmap, and (Bottom) T-SNE distribution.
  • Figure 5: The boxplot comparison of semantic similarity for HumanML3D grounded motion-text pairs
  • ...and 1 more figures