Table of Contents
Fetching ...

Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization

Zhuo Tao, Liang Li, Qi Chen, Yunbin Tu, Zheng-Jun Zha, Ming-Hsuan Yang, Yuankai Qi, Qingming Huang

TL;DR

This work tackles natural language video localization under point supervision, where only a single annotated frame within the target moment is available. It introduces COTEL, a framework that jointly learns frame-level saliency and segment-level moment proposals via Temporal Consistency Learning (TCL), and enforces mutual enhancement through cross-consistency guidance (Frame-Level and Segment-Level) and a Hierarchical Contrastive Alignment Loss (HCAL). By integrating a Gaussian prior around the annotated frame and using both intra- and inter-video contrastive terms, COTEL achieves strong video–text alignment and precise moment localization, approaching fully-supervised performance at a fraction of the annotation cost. Experiments on Charades-STA and TACoS demonstrate state-of-the-art results among point-supervised methods, with ablations confirming the complementary roles of frame- and segment-level paths and the effectiveness of cross-consistency and HCAL. The approach offers practical impact for scalable video understanding where exhaustive temporal annotations are impractical.

Abstract

Natural language video localization (NLVL) is a crucial task in video understanding that aims to localize the target moment in videos specified by a given language description. Recently, a point-supervised paradigm has been presented to address this task, requiring only a single annotated frame within the target moment rather than complete temporal boundaries. Compared with the fully-supervised paradigm, it offers a balance between localization accuracy and annotation cost. However, due to the absence of complete annotation, it is challenging to align the video content with language descriptions, consequently hindering accurate moment prediction. To address this problem, we propose a new COllaborative Temporal consistEncy Learning (COTEL) framework that leverages the synergy between saliency detection and moment localization to strengthen the video-language alignment. Specifically, we first design a frame- and a segment-level Temporal Consistency Learning (TCL) module that models semantic alignment across frame saliencies and sentence-moment pairs. Then, we design a cross-consistency guidance scheme, including a Frame-level Consistency Guidance (FCG) and a Segment-level Consistency Guidance (SCG), that enables the two temporal consistency learning paths to reinforce each other mutually. Further, we introduce a Hierarchical Contrastive Alignment Loss (HCAL) to comprehensively align the video and text query. Extensive experiments on two benchmarks demonstrate that our method performs favorably against SoTA approaches. We will release all the source codes.

Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization

TL;DR

This work tackles natural language video localization under point supervision, where only a single annotated frame within the target moment is available. It introduces COTEL, a framework that jointly learns frame-level saliency and segment-level moment proposals via Temporal Consistency Learning (TCL), and enforces mutual enhancement through cross-consistency guidance (Frame-Level and Segment-Level) and a Hierarchical Contrastive Alignment Loss (HCAL). By integrating a Gaussian prior around the annotated frame and using both intra- and inter-video contrastive terms, COTEL achieves strong video–text alignment and precise moment localization, approaching fully-supervised performance at a fraction of the annotation cost. Experiments on Charades-STA and TACoS demonstrate state-of-the-art results among point-supervised methods, with ablations confirming the complementary roles of frame- and segment-level paths and the effectiveness of cross-consistency and HCAL. The approach offers practical impact for scalable video understanding where exhaustive temporal annotations are impractical.

Abstract

Natural language video localization (NLVL) is a crucial task in video understanding that aims to localize the target moment in videos specified by a given language description. Recently, a point-supervised paradigm has been presented to address this task, requiring only a single annotated frame within the target moment rather than complete temporal boundaries. Compared with the fully-supervised paradigm, it offers a balance between localization accuracy and annotation cost. However, due to the absence of complete annotation, it is challenging to align the video content with language descriptions, consequently hindering accurate moment prediction. To address this problem, we propose a new COllaborative Temporal consistEncy Learning (COTEL) framework that leverages the synergy between saliency detection and moment localization to strengthen the video-language alignment. Specifically, we first design a frame- and a segment-level Temporal Consistency Learning (TCL) module that models semantic alignment across frame saliencies and sentence-moment pairs. Then, we design a cross-consistency guidance scheme, including a Frame-level Consistency Guidance (FCG) and a Segment-level Consistency Guidance (SCG), that enables the two temporal consistency learning paths to reinforce each other mutually. Further, we introduce a Hierarchical Contrastive Alignment Loss (HCAL) to comprehensively align the video and text query. Extensive experiments on two benchmarks demonstrate that our method performs favorably against SoTA approaches. We will release all the source codes.

Paper Structure

This paper contains 31 sections, 16 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: (a) Different supervision signals. (b) The proposals in the training stage are coarse-grained while fine-grained in the test stage, leading to inconsistency. (c) Frame saliency detection and moment localization share an extensive temporal correspondence between their respective positive temporal segments.
  • Figure 2: (a) Overview of our collaborative temporal consistency learning framework, which consists of multi-modal interaction (Sec.\ref{['sec:Multi-modal Interaction']}), frame-level temporal consistency learning (Frame-level TCL, Sec.\ref{['sec:Frame-level Temporal Consistency Learning']}), segment-level temporal consistency learning (Segment-level TCL, Sec.\ref{['sec:Segment-level Temporal Consistency Learning']}) and cross-consistency guidance (Sec.\ref{['sec:cross consistency guidance']}). (b) The frame-level consistency guidance (Sec.\ref{['sec:Frame Level Guidance']}) utilizes the frame-level saliency scores to enhance the fine-grained alignment of video and text in the segment-level moment localization. (c) Segment-level consistency guidance (Sec.\ref{['sec:Segment Level Guidance']}) generates the semantic-aware consistency mask with a proposal-based mask generator to guide frame-level TCL. (d) Hierarchical contrastive alignment loss (HCAL, Sec.\ref{['sec:Hierarchical Contrastive Alignment Loss']}) consists of intra-video selective alignment and inter-video contrastive mining losses to regularize the alignment between the video and paired sentence.
  • Figure 3: The effect of coefficients for intra-video selective alignment loss and inter-video contrastive mining loss on model performance. We present results of R1@0.5 on the Charades-STA.
  • Figure 4: Qualitative examples of top-1 predictions on Charades-STA. We compare our method with ViGA viga and D3G d3g. GT indicates the ground truth temporal boundary.