Table of Contents
Fetching ...

Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding

Akash Kumar, Zsolt Kira, Yogesh Singh Rawat

TL;DR

This work tackles Weakly Supervised Spatio-Temporal Video Grounding (WSTVG) by extending a vision-language grounding foundation model to video. It identifies key limitations of applying image-grounding models like Grounding DINO to videos, notably temporal inconsistency, difficulty with complex queries, and dense scenes. The authors introduce CoSPaL, comprising Tubelet Phrase Grounding (TPG) for joint spatio-temporal tubelet grounding, Contextual Referral Grounding (CRG) for extracting and leveraging query context, and Self-Paced Scene Understanding (SPS) to progressively increase task difficulty during training. The approach combines frame-level detections with tubelet tracking, cross-attention-based grounding, temporal reconstruction losses, and curriculum learning, achieving state-of-the-art results on VidSTG and HCSTVG-v1/v2 under weak supervision. Overall, CoSPaL demonstrates robust, scalable video grounding with improved temporal consistency and query comprehension, reducing the gap to fully supervised methods while avoiding spatio-temporal labeling requirements.

Abstract

In this work, we focus on Weakly Supervised Spatio-Temporal Video Grounding (WSTVG). It is a multimodal task aimed at localizing specific subjects spatio-temporally based on textual queries without bounding box supervision. Motivated by recent advancements in multi-modal foundation models for grounding tasks, we first explore the potential of state-of-the-art object detection models for WSTVG. Despite their robust zero-shot capabilities, our adaptation reveals significant limitations, including inconsistent temporal predictions, inadequate understanding of complex queries, and challenges in adapting to difficult scenarios. We propose CoSPaL (Contextual Self-Paced Learning), a novel approach which is designed to overcome these limitations. CoSPaL integrates three core components: (1) Tubelet Phrase Grounding (TPG), which introduces spatio-temporal prediction by linking textual queries to tubelets; (2) Contextual Referral Grounding (CRG), which improves comprehension of complex queries by extracting contextual information to refine object identification over time; and (3) Self-Paced Scene Understanding (SPS), a training paradigm that progressively increases task difficulty, enabling the model to adapt to complex scenarios by transitioning from coarse to fine-grained understanding.

Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding

TL;DR

This work tackles Weakly Supervised Spatio-Temporal Video Grounding (WSTVG) by extending a vision-language grounding foundation model to video. It identifies key limitations of applying image-grounding models like Grounding DINO to videos, notably temporal inconsistency, difficulty with complex queries, and dense scenes. The authors introduce CoSPaL, comprising Tubelet Phrase Grounding (TPG) for joint spatio-temporal tubelet grounding, Contextual Referral Grounding (CRG) for extracting and leveraging query context, and Self-Paced Scene Understanding (SPS) to progressively increase task difficulty during training. The approach combines frame-level detections with tubelet tracking, cross-attention-based grounding, temporal reconstruction losses, and curriculum learning, achieving state-of-the-art results on VidSTG and HCSTVG-v1/v2 under weak supervision. Overall, CoSPaL demonstrates robust, scalable video grounding with improved temporal consistency and query comprehension, reducing the gap to fully supervised methods while avoiding spatio-temporal labeling requirements.

Abstract

In this work, we focus on Weakly Supervised Spatio-Temporal Video Grounding (WSTVG). It is a multimodal task aimed at localizing specific subjects spatio-temporally based on textual queries without bounding box supervision. Motivated by recent advancements in multi-modal foundation models for grounding tasks, we first explore the potential of state-of-the-art object detection models for WSTVG. Despite their robust zero-shot capabilities, our adaptation reveals significant limitations, including inconsistent temporal predictions, inadequate understanding of complex queries, and challenges in adapting to difficult scenarios. We propose CoSPaL (Contextual Self-Paced Learning), a novel approach which is designed to overcome these limitations. CoSPaL integrates three core components: (1) Tubelet Phrase Grounding (TPG), which introduces spatio-temporal prediction by linking textual queries to tubelets; (2) Contextual Referral Grounding (CRG), which improves comprehension of complex queries by extracting contextual information to refine object identification over time; and (3) Self-Paced Scene Understanding (SPS), a training paradigm that progressively increases task difficulty, enabling the model to adapt to complex scenarios by transitioning from coarse to fine-grained understanding.

Paper Structure

This paper contains 42 sections, 5 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: Comparison across tasks.(Left) (a) Phrase grounding (PG) refers to grounding all nouns in the sentence, (b) Referral grounding (RG) makes the task harder by grounding specific subject, (c) Video object grounding (VOG) has fixed number of object categories and query template is fixed (d) Temporal video grounding (TVG) only focuses on temporal localization. Contrast to these, (e) STVG requires spatio-temporal grounding of specific subject using free-form query. Green denotes ground truth. Darker shade denotes temporal boundary. (Right) Table summarizes challenges involved in STVG against other tasks.
  • Figure 2: Illustration of failures of W-GDINO: (a) Unreliable Temporal Predictions: Foundation model predictions are inconsistent across time and switch attention between actors across time. This leads to performance degradation. (b) Imbalanced Query Attention: It showcases that model lacks understanding of complex queries. Across time, query which model attends to for each subject tubelet is inconsistent and doesn't match with ground truth, (c) Complex Scene Understanding: As the number of subjects increase, model's capability to focus on the specific subject described in query reduces. This shows it's lack of understanding of challenging scenarios. K denotes total number of subjects. Blue and red denotes predictions and green denotes ground truth in (a) and (c), and brown in (b).
  • Figure 3: Overview of CoSPaL: TPG contains two grounding modules namely, spatial and temporal. Spatial module grounds the correct subject tubelet. Temporal module predicts the temporal action boundary via cross attention between highlighted vision features and masked query features. Contextual Referral Grounding (CRG) block shows the breakdown and generation of local ($Q_{ol}$) and global query ($Q_{og}$). Green shows predicted bounding box. Darker green shade shows predicted temporal boundary localization.
  • Figure 4: Qualitative analysis:Green: ground truth; red:W-GDINO, and blue: CoSPaL (darker shade represents temporal detection boundaries). W-GDINO suffers from temporal localization and imbalanced attention focusing on different subjects throughout the video. CoSPaL overcomes these limitations and has better overlap with GT in both scenarios.
  • Figure 5: Comparison on computational efficiency against fully supervised approaches.
  • ...and 8 more figures