Contrast-Unity for Partially-Supervised Temporal Sentence Grounding
Haicheng Wang, Chen Ju, Weixiong Lin, Chaofan Ma, Shuai Xiao, Ya Zhang, Yanfeng Wang
TL;DR
The paper addresses temporal sentence grounding under a partially-supervised regime where only short clips within the event are labeled. It proposes Contrast-Unity, a two-stage implicit-explicit framework that first refines grounding through quadruple-contrastive representation learning to produce pseudo-labels and then trains a fully-supervised model with those labels. Key contributions include the partial-supervision formulation, the quadruple-contrastive loss design with intra- and inter-sample components, and an event-detector-based pseudo-label generator, validated by state-of-the-art results on Charades-STA and ActivityNet Captions under single-frame and short-clip supervision. The approach offers a flexible path between weak and full supervision, reducing annotation costs while maintaining strong grounding performance and generalizing across fully-supervised backbones.
Abstract
Temporal sentence grounding aims to detect event timestamps described by the natural language query from given untrimmed videos. The existing fully-supervised setting achieves great results but requires expensive annotation costs; while the weakly-supervised setting adopts cheap labels but performs poorly. To pursue high performance with less annotation costs, this paper introduces an intermediate partially-supervised setting, i.e., only short-clip is available during training. To make full use of partial labels, we specially design one contrast-unity framework, with the two-stage goal of implicit-explicit progressive grounding. In the implicit stage, we align event-query representations at fine granularity using comprehensive quadruple contrastive learning: event-query gather, event-background separation, intra-cluster compactness and inter-cluster separability. Then, high-quality representations bring acceptable grounding pseudo-labels. In the explicit stage, to explicitly optimize grounding objectives, we train one fully-supervised model using obtained pseudo-labels for grounding refinement and denoising. Extensive experiments and thoroughly ablations on Charades-STA and ActivityNet Captions demonstrate the significance of partial supervision, as well as our superior performance.
