Table of Contents
Fetching ...

OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding

Jiali Yao, Xinran Deng, Xin Gu, Mengrui Dai, Bing Fan, Zhipeng Zhang, Yan Huang, Heng Fan, Libo Zhang

TL;DR

OmniSTVG advances video grounding by aiming to localize all targets mentioned in a textual query within untrimmed videos, including interacting counterparts. The authors introduce the BOSTVG dataset (10,018 videos, 287 classes, 10.2M frames) and the OmniTube baseline, a Transformer-based architecture with a multimodal encoder and separate spatial/temporal decoders to predict multiple spatial tubes and their temporal extents. Through extensive experiments and ablations, OmniTube demonstrates strong performance and validates the design choices, highlighting the practicality of grounding all queried objects for richer video understanding. This work provides a new benchmark, methodology, and baseline to drive research in comprehensive spatio-temporal grounding with multiple targets and interactions.

Abstract

In this paper, we propose spatio-temporal omni-object video grounding, dubbed OmniSTVG, a new STVG task that aims at localizing spatially and temporally all targets mentioned in the textual query from videos. Compared to classic STVG locating only a single target, OmniSTVG enables localization of not only an arbitrary number of text-referred targets but also their interacting counterparts in the query from the video, making it more flexible and practical in real scenarios for comprehensive understanding. In order to facilitate exploration of OmniSTVG, we introduce BOSTVG, a large-scale benchmark dedicated to OmniSTVG. Specifically, our BOSTVG consists of 10,018 videos with 10.2M frames and covers a wide selection of 287 classes from diverse scenarios. Each sequence in BOSTVG, paired with a free-form textual query, encompasses a varying number of targets ranging from 1 to 10. To ensure high quality, each video is manually annotated with meticulous inspection and refinement. To our best knowledge, BOSTVG is to date the first and the largest benchmark for OmniSTVG. To encourage future research, we introduce a simple yet effective approach, named OmniTube, which, drawing inspiration from Transformer-based STVG methods, is specially designed for OmniSTVG and demonstrates promising results. By releasing BOSTVG, we hope to go beyond classic STVG by locating every object appearing in the query for more comprehensive understanding, opening up a new direction for STVG. Our benchmark, model, and results will be released at https://github.com/JellyYao3000/OmniSTVG.

OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding

TL;DR

OmniSTVG advances video grounding by aiming to localize all targets mentioned in a textual query within untrimmed videos, including interacting counterparts. The authors introduce the BOSTVG dataset (10,018 videos, 287 classes, 10.2M frames) and the OmniTube baseline, a Transformer-based architecture with a multimodal encoder and separate spatial/temporal decoders to predict multiple spatial tubes and their temporal extents. Through extensive experiments and ablations, OmniTube demonstrates strong performance and validates the design choices, highlighting the practicality of grounding all queried objects for richer video understanding. This work provides a new benchmark, methodology, and baseline to drive research in comprehensive spatio-temporal grounding with multiple targets and interactions.

Abstract

In this paper, we propose spatio-temporal omni-object video grounding, dubbed OmniSTVG, a new STVG task that aims at localizing spatially and temporally all targets mentioned in the textual query from videos. Compared to classic STVG locating only a single target, OmniSTVG enables localization of not only an arbitrary number of text-referred targets but also their interacting counterparts in the query from the video, making it more flexible and practical in real scenarios for comprehensive understanding. In order to facilitate exploration of OmniSTVG, we introduce BOSTVG, a large-scale benchmark dedicated to OmniSTVG. Specifically, our BOSTVG consists of 10,018 videos with 10.2M frames and covers a wide selection of 287 classes from diverse scenarios. Each sequence in BOSTVG, paired with a free-form textual query, encompasses a varying number of targets ranging from 1 to 10. To ensure high quality, each video is manually annotated with meticulous inspection and refinement. To our best knowledge, BOSTVG is to date the first and the largest benchmark for OmniSTVG. To encourage future research, we introduce a simple yet effective approach, named OmniTube, which, drawing inspiration from Transformer-based STVG methods, is specially designed for OmniSTVG and demonstrates promising results. By releasing BOSTVG, we hope to go beyond classic STVG by locating every object appearing in the query for more comprehensive understanding, opening up a new direction for STVG. Our benchmark, model, and results will be released at https://github.com/JellyYao3000/OmniSTVG.

Paper Structure

This paper contains 25 sections, 15 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: Illustration and comparison of existing STVG that localizes a single object in the query (image (a) in the top) and our OmniSTVG locating all objects in the query (image (b) in the bottom). The object in the textual query and its corresponding spatio-temporal tube in the video is highlighted using the same color (please notice that, the tubes for "women" and "whales" in (b) are displayed in different colors for better distinction). Best viewed in color and by zooming in for all figures throughout the paper.
  • Figure 3: Wordcloud of all textual queries.
  • Figure 4: Overview of the proposed OmniTube, which consists of a multimodal encoder, a spatio-temporal decoder, and a spatial-temporal box tube generation module to localize all mentioned target objects in the textual query for OmniSTVG.
  • Figure 5: Category organization of our BOSTVG. The inner circle of the pie chart displays 23 coarser object classes, while the outer circle displays 287 fine object categories. Best viewed in pdf and by zooming in.
  • Figure 6: Distribution of textual query length (in characters)
  • ...and 9 more figures