Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding
Chaolei Tan, Jianhuang Lai, Wei-Shi Zheng, Jian-Fang Hu
TL;DR
This work tackles Weakly-Supervised Video Paragraph Grounding (WSVPG), eliminating the need for timestamp labels by learning both cross-modal alignment and temporal boundary regression in a one-stage, proposal-free framework. It introduces SiamGTR, a siamese Grounding Transformer with an Augmentation Branch that regresses pseudo paragraph boundaries in a constructed pseudo-video and an Inference Branch that learns order-guided alignment on real videos, sharing parameters to transfer supervision. A pseudo data generation scheme with random boundary shifting, plus a set of losses including self-consistent boundary regression and order-guided attention, yields strong performance on ActivityNet-Captions, Charades-CD-OOD, and TACoS under weakly- and semi-supervised settings, often surpassing fully-supervised baselines. The approach reduces labeling costs while delivering competitive or superior localization accuracy, demonstrating practical potential for scalable video-language grounding and enabling easy extension to semi-supervised learning regimes.
Abstract
Video Paragraph Grounding (VPG) is an emerging task in video-language understanding, which aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video. However, existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire. In this work, we introduce and explore Weakly-Supervised Video Paragraph Grounding (WSVPG) to eliminate the need of temporal annotations. Different from previous weakly-supervised grounding frameworks based on multiple instance learning or reconstruction learning for two-stage candidate ranking, we propose a novel siamese learning framework that jointly learns the cross-modal feature alignment and temporal coordinate regression without timestamp labels to achieve concise one-stage localization for WSVPG. Specifically, we devise a Siamese Grounding TRansformer (SiamGTR) consisting of two weight-sharing branches for learning complementary supervision. An Augmentation Branch is utilized for directly regressing the temporal boundaries of a complete paragraph within a pseudo video, and an Inference Branch is designed to capture the order-guided feature correspondence for localizing multiple sentences in a normal video. We demonstrate by extensive experiments that our paradigm has superior practicability and flexibility to achieve efficient weakly-supervised or semi-supervised learning, outperforming state-of-the-art methods trained with the same or stronger supervision.
