Table of Contents
Fetching ...

Described Spatial-Temporal Video Detection

Wei Ji, Xiangyan Liu, Yingfei Sun, Jiajun Deng, You Qin, Ammar Nuwanna, Mengyao Qiu, Lina Wei, Roger Zimmermann

TL;DR

The work introduces Described Spatial-Temporal Video Detection (DSTVD) to overcome STVG's limitation of grounding a single object by enabling descriptions to refer to none-to-many objects and providing tubelet-level spatio-temporal localization. A new DVD-ST dataset is released with 5734 descriptions over 2750 videos and over 150 entity types, accompanied by rigorous instance-level annotations and evaluation metrics that capture spatial, temporal, and multi-object grounding performance. To tackle DSTVD, the authors adapt two transformer-based STVG models, TubeDETR and STCAT, by adding tubelet queries and a tubelet-wise matcher, along with training objective adjustments to handle variable object counts and complex queries. Experimental results show the proposed baselines can ground multiple objects across video frames, with performance positively influenced by description length and negatively affected by high entity counts, highlighting both the promise and the challenges of described spatial-temporal video detection for real-world applications.

Abstract

Detecting visual content on language expression has become an emerging topic in the community. However, in the video domain, the existing setting, i.e., spatial-temporal video grounding (STVG), is formulated to only detect one pre-existing object in each frame, ignoring the fact that language descriptions can involve none or multiple entities within a video. In this work, we advance the STVG to a more practical setting called described spatial-temporal video detection (DSTVD) by overcoming the above limitation. To facilitate the exploration of DSTVD, we first introduce a new benchmark, namely DVD-ST. Notably, DVD-ST supports grounding from none to many objects onto the video in response to queries and encompasses a diverse range of over 150 entities, including appearance, actions, locations, and interactions. The extensive breadth and diversity of the DVD-ST dataset make it an exemplary testbed for the investigation of DSTVD. In addition to the new benchmark, we further present two baseline methods for our proposed DSTVD task by extending two representative STVG models, i.e., TubeDETR, and STCAT. These extended models capitalize on tubelet queries to localize and track referred objects across the video sequence. Besides, we adjust the training objectives of these models to optimize spatial and temporal localization accuracy and multi-class classification capabilities. Furthermore, we benchmark the baselines on the introduced DVD-ST dataset and conduct extensive experimental analysis to guide future investigation. Our code and benchmark will be publicly available.

Described Spatial-Temporal Video Detection

TL;DR

The work introduces Described Spatial-Temporal Video Detection (DSTVD) to overcome STVG's limitation of grounding a single object by enabling descriptions to refer to none-to-many objects and providing tubelet-level spatio-temporal localization. A new DVD-ST dataset is released with 5734 descriptions over 2750 videos and over 150 entity types, accompanied by rigorous instance-level annotations and evaluation metrics that capture spatial, temporal, and multi-object grounding performance. To tackle DSTVD, the authors adapt two transformer-based STVG models, TubeDETR and STCAT, by adding tubelet queries and a tubelet-wise matcher, along with training objective adjustments to handle variable object counts and complex queries. Experimental results show the proposed baselines can ground multiple objects across video frames, with performance positively influenced by description length and negatively affected by high entity counts, highlighting both the promise and the challenges of described spatial-temporal video detection for real-world applications.

Abstract

Detecting visual content on language expression has become an emerging topic in the community. However, in the video domain, the existing setting, i.e., spatial-temporal video grounding (STVG), is formulated to only detect one pre-existing object in each frame, ignoring the fact that language descriptions can involve none or multiple entities within a video. In this work, we advance the STVG to a more practical setting called described spatial-temporal video detection (DSTVD) by overcoming the above limitation. To facilitate the exploration of DSTVD, we first introduce a new benchmark, namely DVD-ST. Notably, DVD-ST supports grounding from none to many objects onto the video in response to queries and encompasses a diverse range of over 150 entities, including appearance, actions, locations, and interactions. The extensive breadth and diversity of the DVD-ST dataset make it an exemplary testbed for the investigation of DSTVD. In addition to the new benchmark, we further present two baseline methods for our proposed DSTVD task by extending two representative STVG models, i.e., TubeDETR, and STCAT. These extended models capitalize on tubelet queries to localize and track referred objects across the video sequence. Besides, we adjust the training objectives of these models to optimize spatial and temporal localization accuracy and multi-class classification capabilities. Furthermore, we benchmark the baselines on the introduced DVD-ST dataset and conduct extensive experimental analysis to guide future investigation. Our code and benchmark will be publicly available.
Paper Structure (38 sections, 3 equations, 6 figures, 5 tables)

This paper contains 38 sections, 3 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison between the VidSTG dataset and our DVD-ST in terms of generalizability of descriptions and number of referred objects. VidSTG is one of the representative STVG datasets, while our DVD-ST aims to benchmark a more practical described spatial-temporal video detection setting.
  • Figure 2: Examples of the described queries from DVD-ST, which have abundant entities in semantics.
  • Figure 3: Word cloud of the described queries from DVD-ST, which includes sufficient object and relation entities.
  • Figure 4: Overview of the annotation platform and dataset statistics: (a) shows the interface of the annotation platform, (b) illustrates the distribution of objects, and (c) presents the most frequent objects in the dataset.
  • Figure 5: Illustration of our proposed TubeDETR-M framework, which is a simple yet effective baseline for DSTVD task. All input video frames and the description are first processed with a Visual Encoder and a Text Encoder. The resulting text $h_v$ and video features $h_q$ are then jointly encoded with a Video-Text Encoder that computes spatial and multi-modal interactions. The resulting video-text features are then decoded into the output spatio-temporal tube using a Transformer Decoder, which is guided by tubelet queries. Our adaptations for DSTVD primarily focus on 1) improvements to the decoder input side, and 2) the introduction of a tubelet-wise matcher. These enhancements align with our another framework, STCAT-M.
  • ...and 1 more figures