Table of Contents
Fetching ...

SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability

Jiankang Wang, Zhihan Zhang, Zhihang Liu, Yang Li, Jiannan Ge, Hongtao Xie, Yongdong Zhang

TL;DR

SpaceVLLM introduces spatio-temporal aware queries and a Query-Guided Space Decoder to empower a multimodal LLM with joint spatio-temporal video grounding. It also introduces Uni-STG, a 480K-sample Unified Spatio-Temporal Grounding dataset to train and evaluate VTG, REC, and STVG. Experiments show state-of-the-art performance across 11 benchmarks, including STVG, VTG, REC and video understanding tasks, demonstrating robust spatio-temporal localization and general video understanding capabilities. The work provides a scalable approach and releases datasets and code.

Abstract

Multimodal large language models (MLLMs) have made remarkable progress in either temporal or spatial localization. However, they struggle to perform spatio-temporal video grounding. This limitation stems from two major challenges. Firstly, it is difficult to extract accurate spatio-temporal information of each frame in the video. Secondly, the substantial number of visual tokens makes it challenging to precisely map visual tokens of each frame to their corresponding spatial coordinates. To address these issues, we introduce SpaceVLLM, a MLLM endowed with spatio-temporal video grounding capability. Specifically, we adopt a set of interleaved Spatio-Temporal Aware Queries to capture temporal perception and dynamic spatial information. Moreover, we propose a Query-Guided Space Decoder to establish a corresponding connection between the queries and spatial coordinates. Additionally, due to the lack of spatio-temporal datasets, we construct the Unified Spatio-Temporal Grounding (Uni-STG) dataset, comprising 480K instances across three tasks. This dataset fully exploits the potential of MLLM to simultaneously facilitate localization in both temporal and spatial dimensions. Extensive experiments demonstrate that SpaceVLLM achieves the state-of-the-art performance across 11 benchmarks covering temporal, spatial, spatio-temporal and video understanding tasks, highlighting the effectiveness of our approach. Our code, datasets and model will be released at https://github.com/Jayce1kk/SpaceVLLM.

SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability

TL;DR

SpaceVLLM introduces spatio-temporal aware queries and a Query-Guided Space Decoder to empower a multimodal LLM with joint spatio-temporal video grounding. It also introduces Uni-STG, a 480K-sample Unified Spatio-Temporal Grounding dataset to train and evaluate VTG, REC, and STVG. Experiments show state-of-the-art performance across 11 benchmarks, including STVG, VTG, REC and video understanding tasks, demonstrating robust spatio-temporal localization and general video understanding capabilities. The work provides a scalable approach and releases datasets and code.

Abstract

Multimodal large language models (MLLMs) have made remarkable progress in either temporal or spatial localization. However, they struggle to perform spatio-temporal video grounding. This limitation stems from two major challenges. Firstly, it is difficult to extract accurate spatio-temporal information of each frame in the video. Secondly, the substantial number of visual tokens makes it challenging to precisely map visual tokens of each frame to their corresponding spatial coordinates. To address these issues, we introduce SpaceVLLM, a MLLM endowed with spatio-temporal video grounding capability. Specifically, we adopt a set of interleaved Spatio-Temporal Aware Queries to capture temporal perception and dynamic spatial information. Moreover, we propose a Query-Guided Space Decoder to establish a corresponding connection between the queries and spatial coordinates. Additionally, due to the lack of spatio-temporal datasets, we construct the Unified Spatio-Temporal Grounding (Uni-STG) dataset, comprising 480K instances across three tasks. This dataset fully exploits the potential of MLLM to simultaneously facilitate localization in both temporal and spatial dimensions. Extensive experiments demonstrate that SpaceVLLM achieves the state-of-the-art performance across 11 benchmarks covering temporal, spatial, spatio-temporal and video understanding tasks, highlighting the effectiveness of our approach. Our code, datasets and model will be released at https://github.com/Jayce1kk/SpaceVLLM.

Paper Structure

This paper contains 25 sections, 6 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Example of the Video Temporal Grounding (VTG), Referring Expression Comprehension (REC) and Spatio-Temporal Video Grounding (STVG) tasks in the proposed SpaceVLLM.
  • Figure 2: The Overall Architecture of SpaceVLLM. In SpaceVLLM, A set of ordered Spatio-Temporal Aware Queries is interleaved with visual tokens of each video frame to capture spatio-temporal information. The LLM's last-layer query embeddings, combined with corresponding visual and description embeddings, are fed into the Query-Guided Space Decoder to predict frame-wise coordinates.
  • Figure 3: Pipeline of data synthesis for STVG task.
  • Figure 4: Data characteristics of Uni-STG for STVG task.
  • Figure 5: Visualization between LLM-based model for the task of Spatio-Temporal Video Grounding. As for the box in the video, green is the ground-truth bounding box, purple is the Qwen 2.5 VL, yellow is the GroundingGPT and red is our SpaceVLLM.