Table of Contents
Fetching ...

Correlation-Guided Query-Dependency Calibration for Video Temporal Grounding

WonJun Moon, Sangeek Hyun, SuBeen Lee, Jae-Pil Heo

TL;DR

This work tackles video temporal grounding by calibrating the degree of cross-modal interaction between text queries and video clips. It introduces CG-DETR, which uses adaptive cross-attention with dummy tokens to control text engagement, a clip-word correlation learner to infer fine-grained clip-word relations, and a moment-adaptive saliency detector to integrate context with calibrated interactions. The approach yields state-of-the-art results across multiple moment retrieval and highlight detection benchmarks, with comprehensive ablations validating each component and demonstrating robustness to pretraining. Overall, CG-DETR provides a principled framework for coarse-to-fine cross-modal understanding in video grounding, with potential implications for more interpretable and efficient multimodal transformers.

Abstract

Temporal Grounding is to identify specific moments or highlights from a video corresponding to textual descriptions. Typical approaches in temporal grounding treat all video clips equally during the encoding process regardless of their semantic relevance with the text query. Therefore, we propose Correlation-Guided DEtection TRansformer (CG-DETR), exploring to provide clues for query-associated video clips within the cross-modal attention. First, we design an adaptive cross-attention with dummy tokens. Dummy tokens conditioned by text query take portions of the attention weights, preventing irrelevant video clips from being represented by the text query. Yet, not all words equally inherit the text query's correlation to video clips. Thus, we further guide the cross-attention map by inferring the fine-grained correlation between video clips and words. We enable this by learning a joint embedding space for high-level concepts, i.e., moment and sentence level, and inferring the clip-word correlation. Lastly, we exploit the moment-specific characteristics and combine them with the context of each video to form a moment-adaptive saliency detector. By exploiting the degrees of text engagement in each video clip, it precisely measures the highlightness of each clip. CG-DETR achieves state-of-the-art results on various benchmarks for temporal grounding. Codes are available at https://github.com/wjun0830/CGDETR.

Correlation-Guided Query-Dependency Calibration for Video Temporal Grounding

TL;DR

This work tackles video temporal grounding by calibrating the degree of cross-modal interaction between text queries and video clips. It introduces CG-DETR, which uses adaptive cross-attention with dummy tokens to control text engagement, a clip-word correlation learner to infer fine-grained clip-word relations, and a moment-adaptive saliency detector to integrate context with calibrated interactions. The approach yields state-of-the-art results across multiple moment retrieval and highlight detection benchmarks, with comprehensive ablations validating each component and demonstrating robustness to pretraining. Overall, CG-DETR provides a principled framework for coarse-to-fine cross-modal understanding in video grounding, with potential implications for more interpretable and efficient multimodal transformers.

Abstract

Temporal Grounding is to identify specific moments or highlights from a video corresponding to textual descriptions. Typical approaches in temporal grounding treat all video clips equally during the encoding process regardless of their semantic relevance with the text query. Therefore, we propose Correlation-Guided DEtection TRansformer (CG-DETR), exploring to provide clues for query-associated video clips within the cross-modal attention. First, we design an adaptive cross-attention with dummy tokens. Dummy tokens conditioned by text query take portions of the attention weights, preventing irrelevant video clips from being represented by the text query. Yet, not all words equally inherit the text query's correlation to video clips. Thus, we further guide the cross-attention map by inferring the fine-grained correlation between video clips and words. We enable this by learning a joint embedding space for high-level concepts, i.e., moment and sentence level, and inferring the clip-word correlation. Lastly, we exploit the moment-specific characteristics and combine them with the context of each video to form a moment-adaptive saliency detector. By exploiting the degrees of text engagement in each video clip, it precisely measures the highlightness of each clip. CG-DETR achieves state-of-the-art results on various benchmarks for temporal grounding. Codes are available at https://github.com/wjun0830/CGDETR.
Paper Structure (35 sections, 14 equations, 15 figures, 14 tables)

This paper contains 35 sections, 14 equations, 15 figures, 14 tables.

Figures (15)

  • Figure 1: Comparison of degrees of text-to-video correlation in attention layers. In the middle column (b), we compare the clip-wise correspondence score to the text query (sum of attention weights over all words) with its corresponding GT (saliency scores). While the use of (i) self-attention or (ii) cross-attention fails to distinguish target clips based on the degree of cross-modal attention, (iii) ours with adaptive cross-attention exhibits a high activation level for the text query to attend only the query-relevant clips since the dummies occupy a portion of the attention degree on irrelevant clips. We also investigate the fine-grained correlation between clips and words in column (c). Despite the absence of word-level supervision, ours learns to attend more to salient words.
  • Figure 2: CG-DETR overview. From left to right, the model consists of three phases: (i) feature extraction, (ii) correlation-guided feature interaction, and (iii) predictions for grounding tasks. (i) Along with video and text feature extraction, dummy tokens are conditioned by the query to represent the query-excluding meaning. (ii) Correlation-guided feature interaction is performed with adaptive cross-attention. In addition to calibration of text query engagement as a whole, we also guide the word-wise engagement with clip-word correlation learner. At the bottom, a saliency token $T$ is generated with video tokens and saliency candidates according to the value of calibrated attention map. A saliency token is processed via a projector that shares the parameters with query$\textit{Q}$ projection layer in adaptive cross-attention. Details for correlation learner and saliency token are in Fig. \ref{['fig:SCM']}, \ref{['fig:MSD']}. (iii) Finally, tokens are processed through the encoder and decoder to make predictions.
  • Figure 3: Illustration of deriving clip-wise query correspondence $\bar{a}$.
  • Figure 4: Clip-word correlation learner to reflect the relevance between clips and text words into cross-attention. (a) We establish visual moment, non-moment, query, and dummy prototype tokens ($\hat{M}^{b+}, \hat{M}^{b-}, \hat{S}^{b+}, \text{and } \hat{S}^{b-}$ of $b$-th instance within a batch) using learnable moment and sentence tokens. (b) To learn the aligned space, we use contrastive learning. Whereas the query prototype $\hat{S}^{+}$ learns to be aligned with the paired visual moment token $\hat{M}^{+}$, dummy prototype $\hat{S}^{-}$ learn to exclude the moment-specific knowledge. (c) Given the moment-sentence aligned space from (b), we infer the correlation in clip-word level between each clip and text words as well as dummy tokens to form guidance map $G$. Then, guidance is provided to the attention map in the cross-attention.
  • Figure 5: Saliency token generation. The saliency token is obtained by combining video-averaged context token $V_{\text{ctx}}$ with a moment-descriptive token which is calculated by aggregating top-K moment-descriptive candidates. Specifically, we yield a moment-descriptive token by subtracting a context token from clip tokens and then use their correlation to saliency candidates in the pool $P$ as the moment-descriptiveness scores for each candidate. Based on these scores after scaling with clip-wise query correspondence $\bar{a}$, we combine Top-K candidates to construct a moment-descriptive token. This results in the saliency token that not only maintains contextual similarity with video tokens but also adeptly captures the characteristics of specific moments.
  • ...and 10 more figures