Table of Contents
Fetching ...

Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding

Chaolei Tan, Jianhuang Lai, Wei-Shi Zheng, Jian-Fang Hu

TL;DR

This work tackles Weakly-Supervised Video Paragraph Grounding (WSVPG), eliminating the need for timestamp labels by learning both cross-modal alignment and temporal boundary regression in a one-stage, proposal-free framework. It introduces SiamGTR, a siamese Grounding Transformer with an Augmentation Branch that regresses pseudo paragraph boundaries in a constructed pseudo-video and an Inference Branch that learns order-guided alignment on real videos, sharing parameters to transfer supervision. A pseudo data generation scheme with random boundary shifting, plus a set of losses including self-consistent boundary regression and order-guided attention, yields strong performance on ActivityNet-Captions, Charades-CD-OOD, and TACoS under weakly- and semi-supervised settings, often surpassing fully-supervised baselines. The approach reduces labeling costs while delivering competitive or superior localization accuracy, demonstrating practical potential for scalable video-language grounding and enabling easy extension to semi-supervised learning regimes.

Abstract

Video Paragraph Grounding (VPG) is an emerging task in video-language understanding, which aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video. However, existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire. In this work, we introduce and explore Weakly-Supervised Video Paragraph Grounding (WSVPG) to eliminate the need of temporal annotations. Different from previous weakly-supervised grounding frameworks based on multiple instance learning or reconstruction learning for two-stage candidate ranking, we propose a novel siamese learning framework that jointly learns the cross-modal feature alignment and temporal coordinate regression without timestamp labels to achieve concise one-stage localization for WSVPG. Specifically, we devise a Siamese Grounding TRansformer (SiamGTR) consisting of two weight-sharing branches for learning complementary supervision. An Augmentation Branch is utilized for directly regressing the temporal boundaries of a complete paragraph within a pseudo video, and an Inference Branch is designed to capture the order-guided feature correspondence for localizing multiple sentences in a normal video. We demonstrate by extensive experiments that our paradigm has superior practicability and flexibility to achieve efficient weakly-supervised or semi-supervised learning, outperforming state-of-the-art methods trained with the same or stronger supervision.

Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding

TL;DR

This work tackles Weakly-Supervised Video Paragraph Grounding (WSVPG), eliminating the need for timestamp labels by learning both cross-modal alignment and temporal boundary regression in a one-stage, proposal-free framework. It introduces SiamGTR, a siamese Grounding Transformer with an Augmentation Branch that regresses pseudo paragraph boundaries in a constructed pseudo-video and an Inference Branch that learns order-guided alignment on real videos, sharing parameters to transfer supervision. A pseudo data generation scheme with random boundary shifting, plus a set of losses including self-consistent boundary regression and order-guided attention, yields strong performance on ActivityNet-Captions, Charades-CD-OOD, and TACoS under weakly- and semi-supervised settings, often surpassing fully-supervised baselines. The approach reduces labeling costs while delivering competitive or superior localization accuracy, demonstrating practical potential for scalable video-language grounding and enabling easy extension to semi-supervised learning regimes.

Abstract

Video Paragraph Grounding (VPG) is an emerging task in video-language understanding, which aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video. However, existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire. In this work, we introduce and explore Weakly-Supervised Video Paragraph Grounding (WSVPG) to eliminate the need of temporal annotations. Different from previous weakly-supervised grounding frameworks based on multiple instance learning or reconstruction learning for two-stage candidate ranking, we propose a novel siamese learning framework that jointly learns the cross-modal feature alignment and temporal coordinate regression without timestamp labels to achieve concise one-stage localization for WSVPG. Specifically, we devise a Siamese Grounding TRansformer (SiamGTR) consisting of two weight-sharing branches for learning complementary supervision. An Augmentation Branch is utilized for directly regressing the temporal boundaries of a complete paragraph within a pseudo video, and an Inference Branch is designed to capture the order-guided feature correspondence for localizing multiple sentences in a normal video. We demonstrate by extensive experiments that our paradigm has superior practicability and flexibility to achieve efficient weakly-supervised or semi-supervised learning, outperforming state-of-the-art methods trained with the same or stronger supervision.
Paper Structure (18 sections, 9 equations, 4 figures, 7 tables)

This paper contains 18 sections, 9 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: (a) Chronological cross-modal alignment in a video and its paired sentences. (b) Pseudo boundary supervision for regressing paragraph timestamps in a composed pseudo video.
  • Figure 2: Our siamese framework for joint alignment and regression.
  • Figure 3: Illustration of the proposed Siamese Grounding TRansformer (SiamGTR) architecture. The augmentation branch (abbreviated as A.B.) takes the pseudo video features derived from randomly inserting the query-related video features into irrelevant video features. It learns to temporally regress the interval of interest from the pseudo video with the paragraph as query. The Inference Branch (abbreviated as I.B.) receives normal video features for learning the cross-modal feature alignment among multiple sentences in the video.
  • Figure 4: Visualization of prediction results from different models.