Table of Contents
Fetching ...

Multi-Scale Contrastive Learning for Video Temporal Grounding

Thong Thanh Nguyen, Yi Bin, Xiaobao Wu, Zhiyuan Hu, Cong-Duy T Nguyen, See-Kiong Ng, Anh Tuan Luu

TL;DR

The paper tackles temporal grounding across videos of varying lengths, where multi-scale feature pyramids suffer information degradation as moments lengthen. It introduces a multi-scale contrastive learning framework that draws positive and negative moment samples directly from encoder representations, using within-scale and cross-scale losses with query-centric sampling to relate moments across layers and scales without external augmentation or memory banks. The approach combines $ ext{L}_{cls}$, $ ext{L}_{reg}$, $ ext{L}_{within}$, and $ ext{L}_{cross}$ with balancing factors to train a unified model, achieving state-of-the-art performance on both long-form and short-form video grounding benchmarks. This yields robust moment representations across scales, enabling accurate localization in diverse real-world video settings and enhancing practical vision-language grounding applications.

Abstract

Temporal grounding, which localizes video moments related to a natural language query, is a core problem of vision-language learning and video understanding. To encode video moments of varying lengths, recent methods employ a multi-level structure known as a feature pyramid. In this structure, lower levels concentrate on short-range video moments, while higher levels address long-range moments. Because higher levels experience downsampling to accommodate increasing moment length, their capacity to capture information is reduced and consequently leads to degraded information in moment representations. To resolve this problem, we propose a contrastive learning framework to capture salient semantics among video moments. Our key methodology is to leverage samples from the feature space emanating from multiple stages of the video encoder itself requiring neither data augmentation nor online memory banks to obtain positive and negative samples. To enable such an extension, we introduce a sampling process to draw multiple video moments corresponding to a common query. Subsequently, by utilizing these moments' representations across video encoder layers, we instantiate a novel form of multi-scale and cross-scale contrastive learning that links local short-range video moments with global long-range video moments. Extensive experiments demonstrate the effectiveness of our framework for not only long-form but also short-form video grounding.

Multi-Scale Contrastive Learning for Video Temporal Grounding

TL;DR

The paper tackles temporal grounding across videos of varying lengths, where multi-scale feature pyramids suffer information degradation as moments lengthen. It introduces a multi-scale contrastive learning framework that draws positive and negative moment samples directly from encoder representations, using within-scale and cross-scale losses with query-centric sampling to relate moments across layers and scales without external augmentation or memory banks. The approach combines , , , and with balancing factors to train a unified model, achieving state-of-the-art performance on both long-form and short-form video grounding benchmarks. This yields robust moment representations across scales, enabling accurate localization in diverse real-world video settings and enhancing practical vision-language grounding applications.

Abstract

Temporal grounding, which localizes video moments related to a natural language query, is a core problem of vision-language learning and video understanding. To encode video moments of varying lengths, recent methods employ a multi-level structure known as a feature pyramid. In this structure, lower levels concentrate on short-range video moments, while higher levels address long-range moments. Because higher levels experience downsampling to accommodate increasing moment length, their capacity to capture information is reduced and consequently leads to degraded information in moment representations. To resolve this problem, we propose a contrastive learning framework to capture salient semantics among video moments. Our key methodology is to leverage samples from the feature space emanating from multiple stages of the video encoder itself requiring neither data augmentation nor online memory banks to obtain positive and negative samples. To enable such an extension, we introduce a sampling process to draw multiple video moments corresponding to a common query. Subsequently, by utilizing these moments' representations across video encoder layers, we instantiate a novel form of multi-scale and cross-scale contrastive learning that links local short-range video moments with global long-range video moments. Extensive experiments demonstrate the effectiveness of our framework for not only long-form but also short-form video grounding.

Paper Structure

This paper contains 25 sections, 8 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: (Left) Illustration of feature pyramid to encode video moments of different lengths; (Right) An example where recent method SnAG mu2024snag accurately localizes short video moment but fails on long moment.
  • Figure 2: First and Second: IoU results with respect to target video moment length on Ego4D-NLQ grauman2022ego4d of baseline SnAG mu2024snag and our model. Third and Fourth: IoU results with respect to target video moment length on TACoS regneri2013grounding datasets of baseline SnAG mu2024snag and our model.
  • Figure 3: Overall illustration of the proposed framework.