Table of Contents
Fetching ...

STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding

Aaryan Garg, Akash Kumar, Yogesh S Rawat

TL;DR

This work tackles Weakly Supervised Spatio-Temporal Video Grounding by leveraging vision-language foundation models through Tubelet Referral Grounding (TRG) as a baseline. It introduces STPro, a progressive learning framework with Sub-Actions Temporal Curriculum Learning (SA-TCL) and Congestion-Guided Spatial Curriculum Learning (CG-SCL) to address temporal action composition and dense scene grounding. TRG integrates Temporal Referral Module and Spatial Referral Module with reconstruction-based temporal losses and contrastive spatial learning, while SA-TCL and CG-SCL progressively increase task difficulty via sub-action decomposition and tubelet congestion-based sampling. The method achieves state-of-the-art results on VidSTG and HCSTVG benchmarks, with notable improvements over baselines and prior weakly supervised methods, demonstrating strong gains in both temporal localization and spatial grounding under weak supervision. These contributions offer a scalable path toward accurate spatio-temporal grounding without bounding-box annotations, with practical impact for surveillance, robotics, and multimedia understanding.

Abstract

In this work we study Weakly Supervised Spatio-Temporal Video Grounding (WSTVG), a challenging task of localizing subjects spatio-temporally in videos using only textual queries and no bounding box supervision. Inspired by recent advances in vision-language foundation models, we investigate their utility for WSTVG, leveraging their zero-shot grounding capabilities. However, we find that a simple adaptation lacks essential spatio-temporal grounding abilities. To bridge this gap, we introduce Tubelet Referral Grounding (TRG), which connects textual queries to tubelets to enable spatio-temporal predictions. Despite its promise, TRG struggles with compositional action understanding and dense scene scenarios. To address these limitations, we propose STPro, a novel progressive learning framework with two key modules: (1) Sub-Action Temporal Curriculum Learning (SA-TCL), which incrementally builds compositional action understanding, and (2) Congestion-Guided Spatial Curriculum Learning (CG-SCL), which adapts the model to complex scenes by spatially increasing task difficulty. STPro achieves state-of-the-art results on three benchmark datasets, with improvements of 1.0% on VidSTG-Declarative and 3.0% on HCSTVG-v1.

STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding

TL;DR

This work tackles Weakly Supervised Spatio-Temporal Video Grounding by leveraging vision-language foundation models through Tubelet Referral Grounding (TRG) as a baseline. It introduces STPro, a progressive learning framework with Sub-Actions Temporal Curriculum Learning (SA-TCL) and Congestion-Guided Spatial Curriculum Learning (CG-SCL) to address temporal action composition and dense scene grounding. TRG integrates Temporal Referral Module and Spatial Referral Module with reconstruction-based temporal losses and contrastive spatial learning, while SA-TCL and CG-SCL progressively increase task difficulty via sub-action decomposition and tubelet congestion-based sampling. The method achieves state-of-the-art results on VidSTG and HCSTVG benchmarks, with notable improvements over baselines and prior weakly supervised methods, demonstrating strong gains in both temporal localization and spatial grounding under weak supervision. These contributions offer a scalable path toward accurate spatio-temporal grounding without bounding-box annotations, with practical impact for surveillance, robotics, and multimedia understanding.

Abstract

In this work we study Weakly Supervised Spatio-Temporal Video Grounding (WSTVG), a challenging task of localizing subjects spatio-temporally in videos using only textual queries and no bounding box supervision. Inspired by recent advances in vision-language foundation models, we investigate their utility for WSTVG, leveraging their zero-shot grounding capabilities. However, we find that a simple adaptation lacks essential spatio-temporal grounding abilities. To bridge this gap, we introduce Tubelet Referral Grounding (TRG), which connects textual queries to tubelets to enable spatio-temporal predictions. Despite its promise, TRG struggles with compositional action understanding and dense scene scenarios. To address these limitations, we propose STPro, a novel progressive learning framework with two key modules: (1) Sub-Action Temporal Curriculum Learning (SA-TCL), which incrementally builds compositional action understanding, and (2) Congestion-Guided Spatial Curriculum Learning (CG-SCL), which adapts the model to complex scenes by spatially increasing task difficulty. STPro achieves state-of-the-art results on three benchmark datasets, with improvements of 1.0% on VidSTG-Declarative and 3.0% on HCSTVG-v1.

Paper Structure

This paper contains 33 sections, 2 equations, 22 figures, 16 tables.

Figures (22)

  • Figure 1: Overview of TRG: TRG contains two grounding modules namely, TRM and SRM. TRM predicts the temporal action boundary via cross-attention between vision and query features. SRM grounds the correct referral subject tubelet from a set of TRM selected candidates having $tIoU(\mathcal{T}_{o_{k}}, Pred) > T_{filt}$. More details can be found in the supplementary.
  • Figure 1: Distribution of frequently co-occurring soft-labels in tubelets for HCSTVG-1. Grounding-DINO soft labels can be more than one class for a given detection (e.g., person man).
  • Figure 2: Temporal Limitations of TRG: When given a trimmed query (see trim in the figure), TRM misaligns temporal predictions, shifting right instead of left for missing final actions. This indicates it grasps temporal boundaries but lacks compositional understanding of individual actions, limiting generalization to novel sequences.
  • Figure 2: Distribution of the percentage of non-dominant soft-label detections in tubelets for all tubelets with conflicting combinations (e.g., man + woman) in HCSTVG-1.
  • Figure 3: Temporal-Shift Distribution of TRM: Across HCSTVG-1, TRM struggles with temporal compositionality, failing to shift predictions correctly when queries are trimmed. Instead, it predominantly exhibits left/no-shift for first-action trimming and right/no-shift for last-action trimming, revealing its limitations.
  • ...and 17 more figures