Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

Syed Talal Wasim; Muzammal Naseer; Salman Khan; Ming-Hsuan Yang; Fahad Shahbaz Khan

Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan

TL;DR

This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio- Temporal Video Grounding task and proposes a novel spatio-temporal video grounding model, surpassing state-of-the-art results in closed-set evaluations on multiple datasets and demon-strating superior performance in open-vocabulary scenar-ios.

Abstract

Video grounding aims to localize a spatio-temporal section in a video corresponding to an input text query. This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task. Unlike prevalent closed-set approaches that struggle with open-vocabulary scenarios due to limited training data and predefined vocabularies, our model leverages pre-trained representations from foundational spatial grounding models. This empowers it to effectively bridge the semantic gap between natural language and diverse visual content, achieving strong performance in closed-set and open-vocabulary settings. Our contributions include a novel spatio-temporal video grounding model, surpassing state-of-the-art results in closed-set evaluations on multiple datasets and demonstrating superior performance in open-vocabulary scenarios. Notably, the proposed model outperforms state-of-the-art methods in closed-set settings on VidSTG (Declarative and Interrogative) and HC-STVG (V1 and V2) datasets. Furthermore, in open-vocabulary evaluations on HC-STVG V1 and YouCook-Interactions, our model surpasses the recent best-performing models by $4.88$ m_vIoU and $1.83\%$ accuracy, demonstrating its efficacy in handling diverse linguistic and visual concepts for improved video understanding. Our codes will be publicly released.

Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

TL;DR

Abstract

m_vIoU and

accuracy, demonstrating its efficacy in handling diverse linguistic and visual concepts for improved video understanding. Our codes will be publicly released.

Paper Structure (21 sections, 7 equations, 3 figures, 5 tables)

This paper contains 21 sections, 7 equations, 3 figures, 5 tables.

Introduction
Related Work
Methodology
Problem Definition
Spatio-Temporal Video Grounding
Cross-Modality Spatio-Temporal Encoder
Language-Guided Query Selection
Cross-Modality Spatio-Temporal Decoder
Prediction Heads
Loss Function
Results
Experimental Setup and Protocols
Implementation Details
Evaluation Settings
Datasets
...and 6 more sections

Figures (3)

Figure 1: Performance comparison on conventional closed-set and open-vocabulary settings for the video grounding task. We compare our approach with TubeDETR yang2022tubedetr and STCAT jin2022stcat in supervised setting for VidSTG zhang2020stgrnVIDSTG declarative/interrogative and HC-STVG V1 tang2021stgvtHCSTVG, along with open-vocabulary evaluation on HC-STVG V1 and YouCook-Interactions tan2021youcook-inter datasets.
Figure 2: Overall architecture: We present our video grounding architecture. It consists of vision and text encoders that produce visual and textual features. A cross-modality spatio-temporal encoder which fuses information across spatial/temporal dimensions and visual/textual modalities. A language guided query selction module to initialize cross-modal queries. A cross-modality spatio-temporal decoder to decoder queries while fusing information from visual/textual features. And finally two prediction heads to predict the bounding boxes per frame and the temporal tube. Modules with () are trainable and those with () are frozen.
Figure 3: Sample visualization for video grounding result on HC-STVG V1 tang2021stgvtHCSTVG for TubeDETR yang2022tubedetr, STCAT jin2022stcat and ours with the prompt The man behind the shirtless man turns and squats. We show bounding boxes for Ground-truth, Closed-Set Supervised, and Open-Vocabulary results. Note how both TubeDETR and STCAT are close to the ground truth in the supervised setting (STCAT more so than TubeDETR), they cannot correctly ground the text properly in the open-vocabulary setting.

Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

TL;DR

Abstract

Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

Authors

TL;DR

Abstract

Table of Contents

Figures (3)