OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding

Jiali Yao; Xinran Deng; Xin Gu; Mengrui Dai; Bing Fan; Zhipeng Zhang; Yan Huang; Heng Fan; Libo Zhang

OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding

Jiali Yao, Xinran Deng, Xin Gu, Mengrui Dai, Bing Fan, Zhipeng Zhang, Yan Huang, Heng Fan, Libo Zhang

TL;DR

OmniSTVG advances video grounding by aiming to localize all targets mentioned in a textual query within untrimmed videos, including interacting counterparts. The authors introduce the BOSTVG dataset (10,018 videos, 287 classes, 10.2M frames) and the OmniTube baseline, a Transformer-based architecture with a multimodal encoder and separate spatial/temporal decoders to predict multiple spatial tubes and their temporal extents. Through extensive experiments and ablations, OmniTube demonstrates strong performance and validates the design choices, highlighting the practicality of grounding all queried objects for richer video understanding. This work provides a new benchmark, methodology, and baseline to drive research in comprehensive spatio-temporal grounding with multiple targets and interactions.

Abstract

In this paper, we propose spatio-temporal omni-object video grounding, dubbed OmniSTVG, a new STVG task that aims at localizing spatially and temporally all targets mentioned in the textual query from videos. Compared to classic STVG locating only a single target, OmniSTVG enables localization of not only an arbitrary number of text-referred targets but also their interacting counterparts in the query from the video, making it more flexible and practical in real scenarios for comprehensive understanding. In order to facilitate exploration of OmniSTVG, we introduce BOSTVG, a large-scale benchmark dedicated to OmniSTVG. Specifically, our BOSTVG consists of 10,018 videos with 10.2M frames and covers a wide selection of 287 classes from diverse scenarios. Each sequence in BOSTVG, paired with a free-form textual query, encompasses a varying number of targets ranging from 1 to 10. To ensure high quality, each video is manually annotated with meticulous inspection and refinement. To our best knowledge, BOSTVG is to date the first and the largest benchmark for OmniSTVG. To encourage future research, we introduce a simple yet effective approach, named OmniTube, which, drawing inspiration from Transformer-based STVG methods, is specially designed for OmniSTVG and demonstrates promising results. By releasing BOSTVG, we hope to go beyond classic STVG by locating every object appearing in the query for more comprehensive understanding, opening up a new direction for STVG. Our benchmark, model, and results will be released at https://github.com/JellyYao3000/OmniSTVG.

OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding

TL;DR

Abstract

OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)