Table of Contents
Fetching ...

HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos

Tingting Han, Xinsong Tao, Yufei Yin, Min Tan, Sicheng Zhao, Zhou Yu

TL;DR

Hierarchical Embedding-Refinement for Open-Vocabulary grounding is proposed, a unified framework that leverages hierarchical linguistic embeddings and performs parallel cross-modal refinement that consistently surpasses state-of-the-art methods, validating its strong generalization capability and underscoring the significance of OV-TSGV as a new research direction.

Abstract

Temporal Sentence Grounding in Videos (TSGV) aims to temporally localize segments of a video that correspond to a given natural language query. Despite recent progress, most existing TSGV approaches operate under closed-vocabulary settings, limiting their ability to generalize to real-world queries involving novel or diverse linguistic expressions. To bridge this critical gap, we introduce the Open-Vocabulary TSGV (OV-TSGV) task and construct the first dedicated benchmarks--Charades-OV and ActivityNet-OV--that simulate realistic vocabulary shifts and paraphrastic variations. These benchmarks facilitate systematic evaluation of model generalization beyond seen training concepts. To tackle OV-TSGV, we propose HERO(Hierarchical Embedding-Refinement for Open-Vocabulary grounding), a unified framework that leverages hierarchical linguistic embeddings and performs parallel cross-modal refinement. HERO jointly models multi-level semantics and enhances video-language alignment via semantic-guided visual filtering and contrastive masked text refinement. Extensive experiments on both standard and open vocabulary benchmarks demonstrate that HERO consistently surpasses state-of-the-art methods, particularly under open-vocabulary scenarios, validating its strong generalization capability and underscoring the significance of OV-TSGV as a new research direction.

HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos

TL;DR

Hierarchical Embedding-Refinement for Open-Vocabulary grounding is proposed, a unified framework that leverages hierarchical linguistic embeddings and performs parallel cross-modal refinement that consistently surpasses state-of-the-art methods, validating its strong generalization capability and underscoring the significance of OV-TSGV as a new research direction.

Abstract

Temporal Sentence Grounding in Videos (TSGV) aims to temporally localize segments of a video that correspond to a given natural language query. Despite recent progress, most existing TSGV approaches operate under closed-vocabulary settings, limiting their ability to generalize to real-world queries involving novel or diverse linguistic expressions. To bridge this critical gap, we introduce the Open-Vocabulary TSGV (OV-TSGV) task and construct the first dedicated benchmarks--Charades-OV and ActivityNet-OV--that simulate realistic vocabulary shifts and paraphrastic variations. These benchmarks facilitate systematic evaluation of model generalization beyond seen training concepts. To tackle OV-TSGV, we propose HERO(Hierarchical Embedding-Refinement for Open-Vocabulary grounding), a unified framework that leverages hierarchical linguistic embeddings and performs parallel cross-modal refinement. HERO jointly models multi-level semantics and enhances video-language alignment via semantic-guided visual filtering and contrastive masked text refinement. Extensive experiments on both standard and open vocabulary benchmarks demonstrate that HERO consistently surpasses state-of-the-art methods, particularly under open-vocabulary scenarios, validating its strong generalization capability and underscoring the significance of OV-TSGV as a new research direction.
Paper Structure (26 sections, 20 equations, 7 figures, 3 tables)

This paper contains 26 sections, 20 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: (a) Grounding visualization under open-vocabulary settings. (b) (c) Performance comparison on standard benchmarks (Charades-STA and ActivityNet Captions) and the proposed open-vocabulary datasets (Charades-OV, ActivityNet-OV).
  • Figure 2: Distribution of sentences in the test set based on the number of words not present in the training vocabulary.
  • Figure 3: The distribution of words in the top ten frequent terms of the test set that did not appear in the training set. (a) test-ood in charades-cd. (b) test-ov in charades-ov. (c) test-ood in activitynet-cd. (d) test-ov in activitynet-ov. Red indicates that the word did not appear in the training set.
  • Figure 4: Framework overview of HERO. The Hierarchical Embedding Module (HEM) first extracts multi-level text representations from input queries. These hierarchical features then undergo parallel processing through Cross-modal Filtering and Refinement Engine (CFRE), where: (1) Semantic-Guided Visual Filters suppress irrelevant video content, while (2) Contrastive Masked Text Refiners enhance linguistic robustness. Finally, the refined cross-modal features from each CFRE branch are fed into a Temporal Grounding Module to produce hierarchical predictions, which are aggregated via weighted summation for the final temporal localization result.
  • Figure 5: Performance comparison of state-of-the-art methods on different versions of the Charades dataset. The x-axis denotes different TSG models, and each line corresponds to a specific dataset variant. From left to right, the sub-figures report results for R1@0.1, R1@0.5, and R1@0.7, respectively.
  • ...and 2 more figures