Context-Guided Spatio-Temporal Video Grounding

Xin Gu; Heng Fan; Yan Huang; Tiejian Luo; Libo Zhang

Context-Guided Spatio-Temporal Video Grounding

Xin Gu, Heng Fan, Yan Huang, Tiejian Luo, Libo Zhang

TL;DR

This paper tackles spatio-temporal video grounding (STVG) by addressing the limitations of text-only cues in complex videos. It introduces CG-STVG, which mines discriminative instance context from the video via an Instance Context Generation (ICG) module and refines it through Instance Context Refinement (ICR), integrating these cues into a DETR-style, multi-stage transformer decoder to progressively improve target localization. The main contributions are the ICG/ICR modules, a two-level temporal-spatial refinement mechanism, and a comprehensive evaluation showing state-of-the-art results on HCSTVG-v1/v2 and VidSTG with gains in $m_tIoU$ and $m_vIoU$. The approach demonstrates that combining object-level visual context with text guidance enhances robustness against distractors and appearance variations, offering a practical improvement for multimodal video understanding.

Abstract

Spatio-temporal video grounding (or STVG) task aims at locating a spatio-temporal tube for a specific instance given a text query. Despite advancements, current methods easily suffer the distractors or heavy object appearance variations in videos due to insufficient object information from the text, leading to degradation. Addressing this, we propose a novel framework, context-guided STVG (CG-STVG), which mines discriminative instance context for object in videos and applies it as a supplementary guidance for target localization. The key of CG-STVG lies in two specially designed modules, including instance context generation (ICG), which focuses on discovering visual context information (in both appearance and motion) of the instance, and instance context refinement (ICR), which aims to improve the instance context from ICG by eliminating irrelevant or even harmful information from the context. During grounding, ICG, together with ICR, are deployed at each decoding stage of a Transformer architecture for instance context learning. Particularly, instance context learned from one decoding stage is fed to the next stage, and leveraged as a guidance containing rich and discriminative object feature to enhance the target-awareness in decoding feature, which conversely benefits generating better new instance context for improving localization finally. Compared to existing methods, CG-STVG enjoys object information in text query and guidance from mined instance visual context for more accurate target localization. In our experiments on three benchmarks, including HCSTVG-v1/-v2 and VidSTG, CG-STVG sets new state-of-the-arts in m_tIoU and m_vIoU on all of them, showing its efficacy. The code will be released at https://github.com/HengLan/CGSTVG.

Context-Guided Spatio-Temporal Video Grounding

TL;DR

and

. The approach demonstrates that combining object-level visual context with text guidance enhances robustness against distractors and appearance variations, offering a practical improvement for multimodal video understanding.

Abstract

Paper Structure (21 sections, 14 equations, 11 figures, 8 tables)

This paper contains 21 sections, 14 equations, 11 figures, 8 tables.

Introduction
Related Work
The Proposed Method
Multimodal Encoder
Context-Guided Decoder for Grounding
Instance Context Generation (ICG)
Instance Context Refinement (ICR)
Optimization
Experiments
Datasets and Metrics.
State-of-the-art Comparison
Ablation Study
Qualitative Analysis
Conclusion
Detailed Architectures of Modules
...and 6 more sections

Figures (11)

Figure 1: Comparison between (a) existing methods that localize the target using object information from text query and (b) our CG-STVG that enjoys object information from text query and guidance from mined instance context for STVG. Best viewed in color for all figures.
Figure 2: Overview of our method, which consists of a multimodal encoder for feature extraction and a context-guided decoder by cascading a set of decoding stages for grounding. In each decoding stage, instance context is mined to guide query learning for better localization.
Figure 3: Attention maps for spatial queries in video frames in the spatial-decoding block without (image (a)) and with our proposed instance context (image (b)). We can clearly see that our instance context effectively improves target-awareness in the spatial queries and thus the target position information learning for localization. The red boxes indicate the foreground object to localize.
Figure 4: Illustration ICG (image (a)) and ICR (image (b)).
Figure 5: Illustration of ICR for context refinement. The red boxes indicate the foreground, while yellow boxes the instance context. We can see that, our ICR is able to help eliminate irrelevant features in the initial instance context generated from ICG.
...and 6 more figures

Context-Guided Spatio-Temporal Video Grounding

TL;DR

Abstract

Context-Guided Spatio-Temporal Video Grounding

Authors

TL;DR

Abstract

Table of Contents

Figures (11)