Described Spatial-Temporal Video Detection
Wei Ji, Xiangyan Liu, Yingfei Sun, Jiajun Deng, You Qin, Ammar Nuwanna, Mengyao Qiu, Lina Wei, Roger Zimmermann
TL;DR
The work introduces Described Spatial-Temporal Video Detection (DSTVD) to overcome STVG's limitation of grounding a single object by enabling descriptions to refer to none-to-many objects and providing tubelet-level spatio-temporal localization. A new DVD-ST dataset is released with 5734 descriptions over 2750 videos and over 150 entity types, accompanied by rigorous instance-level annotations and evaluation metrics that capture spatial, temporal, and multi-object grounding performance. To tackle DSTVD, the authors adapt two transformer-based STVG models, TubeDETR and STCAT, by adding tubelet queries and a tubelet-wise matcher, along with training objective adjustments to handle variable object counts and complex queries. Experimental results show the proposed baselines can ground multiple objects across video frames, with performance positively influenced by description length and negatively affected by high entity counts, highlighting both the promise and the challenges of described spatial-temporal video detection for real-world applications.
Abstract
Detecting visual content on language expression has become an emerging topic in the community. However, in the video domain, the existing setting, i.e., spatial-temporal video grounding (STVG), is formulated to only detect one pre-existing object in each frame, ignoring the fact that language descriptions can involve none or multiple entities within a video. In this work, we advance the STVG to a more practical setting called described spatial-temporal video detection (DSTVD) by overcoming the above limitation. To facilitate the exploration of DSTVD, we first introduce a new benchmark, namely DVD-ST. Notably, DVD-ST supports grounding from none to many objects onto the video in response to queries and encompasses a diverse range of over 150 entities, including appearance, actions, locations, and interactions. The extensive breadth and diversity of the DVD-ST dataset make it an exemplary testbed for the investigation of DSTVD. In addition to the new benchmark, we further present two baseline methods for our proposed DSTVD task by extending two representative STVG models, i.e., TubeDETR, and STCAT. These extended models capitalize on tubelet queries to localize and track referred objects across the video sequence. Besides, we adjust the training objectives of these models to optimize spatial and temporal localization accuracy and multi-class classification capabilities. Furthermore, we benchmark the baselines on the introduced DVD-ST dataset and conduct extensive experimental analysis to guide future investigation. Our code and benchmark will be publicly available.
