Table of Contents
Fetching ...

CVA: Context-aware Video-text Alignment for Video Temporal Grounding

Sungho Moon, Seunghun Lee, Jiwan Seo, Sunghoon Im

Abstract

We propose Context-aware Video-text Alignment (CVA), a novel framework to address a significant challenge in video temporal grounding: achieving temporally sensitive video-text alignment that remains robust to irrelevant background context. Our framework is built on three key components. First, we propose Query-aware Context Diversification (QCD), a new data augmentation strategy that ensures only semantically unrelated content is mixed in. It builds a video-text similarity-based pool of replacement clips to simulate diverse contexts while preventing the ``false negative" caused by query-agnostic mixing. Second, we introduce the Context-invariant Boundary Discrimination (CBD) loss, a contrastive loss that enforces semantic consistency at challenging temporal boundaries, making their representations robust to contextual shifts and hard negatives. Third, we introduce the Context-enhanced Transformer Encoder (CTE), a hierarchical architecture that combines windowed self-attention and bidirectional cross-attention with learnable queries to capture multi-scale temporal context. Through the synergy of these data-centric and architectural enhancements, CVA achieves state-of-the-art performance on major VTG benchmarks, including QVHighlights and Charades-STA. Notably, our method achieves a significant improvement of approximately 5 points in Recall@1 (R1) scores over state-of-the-art methods, highlighting its effectiveness in mitigating false negatives.

CVA: Context-aware Video-text Alignment for Video Temporal Grounding

Abstract

We propose Context-aware Video-text Alignment (CVA), a novel framework to address a significant challenge in video temporal grounding: achieving temporally sensitive video-text alignment that remains robust to irrelevant background context. Our framework is built on three key components. First, we propose Query-aware Context Diversification (QCD), a new data augmentation strategy that ensures only semantically unrelated content is mixed in. It builds a video-text similarity-based pool of replacement clips to simulate diverse contexts while preventing the ``false negative" caused by query-agnostic mixing. Second, we introduce the Context-invariant Boundary Discrimination (CBD) loss, a contrastive loss that enforces semantic consistency at challenging temporal boundaries, making their representations robust to contextual shifts and hard negatives. Third, we introduce the Context-enhanced Transformer Encoder (CTE), a hierarchical architecture that combines windowed self-attention and bidirectional cross-attention with learnable queries to capture multi-scale temporal context. Through the synergy of these data-centric and architectural enhancements, CVA achieves state-of-the-art performance on major VTG benchmarks, including QVHighlights and Charades-STA. Notably, our method achieves a significant improvement of approximately 5 points in Recall@1 (R1) scores over state-of-the-art methods, highlighting its effectiveness in mitigating false negatives.

Paper Structure

This paper contains 30 sections, 18 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Problem of previous context diversification zhou2025devil.
  • Figure 2: Illustration of our Query-aware Context Diversification.
  • Figure 3: Overview of our CVA. (a) The overall pipeline, which includes the Context-enhanced Transformer Encoder and the Context-invariant Boundary Discrimination module. (b) The detailed architecture of the Context-enhanced Transformer Encoder.
  • Figure A1: Density distribution of cosine similarity across all clip--text pairs in QVHighlights dataset. The clustered structure motivates selecting an intermediate similarity band for QCD.
  • Figure A2: Examples illustrating the risk of using high-similarity clips as negative replacements. Such clips contain semantically relevant actions, and including them introduces false-negative supervision.
  • ...and 2 more figures