Table of Contents
Fetching ...

SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding

Zixu Cheng, Yujiang Pu, Shaogang Gong, Parisa Kordjamshidi, Yu Kong

TL;DR

SHINE tackles compositional temporal grounding by generating semantically plausible hard negatives with GPT-3.5 Turbo and applying a coarse-to-fine saliency ranking within DETR-based video moment retrieval. It introduces hierarchical negative construction across verbs, nouns, adjectives, prepositions, and adverbs, paired with a dual-loss framework: a coarse-grained ranking loss L_cr and a fine-grained ranking loss L_fr, to enforce multi-granularity video-text alignment. Empirical results on Charades-CG and ActivityNet-CG show notable improvements in novel compositions and unseen words while maintaining performance on seen data, illustrating enhanced compositional generalization for DETR-based models. The approach offers a practical, end-to-end enhancement to contemporary temporal grounding pipelines and demonstrates the value of LLM-guided hard negatives in learning nuanced semantic relationships.

Abstract

Temporal grounding, also known as video moment retrieval, aims at locating video segments corresponding to a given query sentence. The compositional nature of natural language enables the localization beyond predefined events, posing a certain challenge to the compositional generalizability of existing methods. Recent studies establish the correspondence between videos and queries through a decompose-reconstruct manner to achieve compositional generalization. However, they only consider dominant primitives and build negative queries through random sampling and recombination, resulting in semantically implausible negatives that hinder the models from learning rational compositions. In addition, recent DETR-based methods still underperform in compositional temporal grounding, showing irrational saliency responses when given negative queries that have subtle differences from positive queries. To address these limitations, we first propose a large language model-driven method for negative query construction, utilizing GPT-3.5-Turbo to generate semantically plausible hard negative queries. Subsequently, we introduce a coarse-to-fine saliency ranking strategy, which encourages the model to learn the multi-granularity semantic relationships between videos and hierarchical negative queries to boost compositional generalization. Extensive experiments on two challenging benchmarks validate the effectiveness and generalizability of our proposed method. Our code is available at https://github.com/zxccade/SHINE.

SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding

TL;DR

SHINE tackles compositional temporal grounding by generating semantically plausible hard negatives with GPT-3.5 Turbo and applying a coarse-to-fine saliency ranking within DETR-based video moment retrieval. It introduces hierarchical negative construction across verbs, nouns, adjectives, prepositions, and adverbs, paired with a dual-loss framework: a coarse-grained ranking loss L_cr and a fine-grained ranking loss L_fr, to enforce multi-granularity video-text alignment. Empirical results on Charades-CG and ActivityNet-CG show notable improvements in novel compositions and unseen words while maintaining performance on seen data, illustrating enhanced compositional generalization for DETR-based models. The approach offers a practical, end-to-end enhancement to contemporary temporal grounding pipelines and demonstrates the value of LLM-guided hard negatives in learning nuanced semantic relationships.

Abstract

Temporal grounding, also known as video moment retrieval, aims at locating video segments corresponding to a given query sentence. The compositional nature of natural language enables the localization beyond predefined events, posing a certain challenge to the compositional generalizability of existing methods. Recent studies establish the correspondence between videos and queries through a decompose-reconstruct manner to achieve compositional generalization. However, they only consider dominant primitives and build negative queries through random sampling and recombination, resulting in semantically implausible negatives that hinder the models from learning rational compositions. In addition, recent DETR-based methods still underperform in compositional temporal grounding, showing irrational saliency responses when given negative queries that have subtle differences from positive queries. To address these limitations, we first propose a large language model-driven method for negative query construction, utilizing GPT-3.5-Turbo to generate semantically plausible hard negative queries. Subsequently, we introduce a coarse-to-fine saliency ranking strategy, which encourages the model to learn the multi-granularity semantic relationships between videos and hierarchical negative queries to boost compositional generalization. Extensive experiments on two challenging benchmarks validate the effectiveness and generalizability of our proposed method. Our code is available at https://github.com/zxccade/SHINE.
Paper Structure (23 sections, 11 equations, 9 figures, 11 tables)

This paper contains 23 sections, 11 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Comparison of saliency scores given different queries. The existing work moon2023query struggles with discerning hard negative queries, showing irrational saliency responses under different primitive substitutions. Our method helps a model to learn the nuances in the semantics of hierarchical negative samples, suppressing the model's response to irrelevant queries while boosting its compositional generalizability.
  • Figure 2: The overall framework of our method SHINE. For each video-text pair, we first generate a set of hierarchical hard negative queries and randomly sample one negative query from the same mini-batch. These queries and the video clips are fed into a DETR-based encoder for interaction and predicting saliency scores $S$. The coarse-grained ranking loss $\mathcal{L}_{cr}$ aims to enlarge the disparity between the saliency scores produced by positive and negative queries, and the fine-grained ranking loss $\mathcal{L}_{fr}$ is designed to capture the nuanced semantics among the hierarchical hard negative queries. These two constraints are combined with $\mathcal{L}_{base}$ to optimize the model.
  • Figure 3: The construction pipeline of hierarchical hard negative queries.
  • Figure 4: An illustration of the Coarse-to-Fine Saliency Ranking strategy.
  • Figure 5: Hyperparameter Evaluation.
  • ...and 4 more figures