Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding
Akash Kumar, Zsolt Kira, Yogesh Singh Rawat
TL;DR
This work tackles Weakly Supervised Spatio-Temporal Video Grounding (WSTVG) by extending a vision-language grounding foundation model to video. It identifies key limitations of applying image-grounding models like Grounding DINO to videos, notably temporal inconsistency, difficulty with complex queries, and dense scenes. The authors introduce CoSPaL, comprising Tubelet Phrase Grounding (TPG) for joint spatio-temporal tubelet grounding, Contextual Referral Grounding (CRG) for extracting and leveraging query context, and Self-Paced Scene Understanding (SPS) to progressively increase task difficulty during training. The approach combines frame-level detections with tubelet tracking, cross-attention-based grounding, temporal reconstruction losses, and curriculum learning, achieving state-of-the-art results on VidSTG and HCSTVG-v1/v2 under weak supervision. Overall, CoSPaL demonstrates robust, scalable video grounding with improved temporal consistency and query comprehension, reducing the gap to fully supervised methods while avoiding spatio-temporal labeling requirements.
Abstract
In this work, we focus on Weakly Supervised Spatio-Temporal Video Grounding (WSTVG). It is a multimodal task aimed at localizing specific subjects spatio-temporally based on textual queries without bounding box supervision. Motivated by recent advancements in multi-modal foundation models for grounding tasks, we first explore the potential of state-of-the-art object detection models for WSTVG. Despite their robust zero-shot capabilities, our adaptation reveals significant limitations, including inconsistent temporal predictions, inadequate understanding of complex queries, and challenges in adapting to difficult scenarios. We propose CoSPaL (Contextual Self-Paced Learning), a novel approach which is designed to overcome these limitations. CoSPaL integrates three core components: (1) Tubelet Phrase Grounding (TPG), which introduces spatio-temporal prediction by linking textual queries to tubelets; (2) Contextual Referral Grounding (CRG), which improves comprehension of complex queries by extracting contextual information to refine object identification over time; and (3) Self-Paced Scene Understanding (SPS), a training paradigm that progressively increases task difficulty, enabling the model to adapt to complex scenarios by transitioning from coarse to fine-grained understanding.
