Table of Contents
Fetching ...

1 + 1 > 2: Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning

Shida Gao, Feng Xue, Xiangfeng Wang, Anlong Ming, Teng Long, Yihua Shao, Haozhe Wang, Zhaowen Lin, Wei Wang, Nicu Sebe

TL;DR

<3-5 sentence high-level summary> DEViL tackles spatio-temporal grounding and reasoning by coupling a multimodal large language model with an open-vocabulary detector via a Reference-Semantic Token (RST). It introduces tube-mined temporal regularization (TTReg) to enforce cross-frame consistency and avoids the error-prone autoregressive, text-based coordinate decoding. A three-stage curriculum bridges the MLLM and detector and enables unified spatio-temporal grounding across STVG, TVG, and grounded VQA. Empirical results show strong spatio-temporal grounding and reasoning across multiple benchmarks, with robustness and improved efficiency on long videos.

Abstract

Spatio-temporal grounding and reasoning aims to locate the temporal segment and spatial region of an event in a video given a user query, while also reasoning about semantics such as causality, temporal order, and action relationships. To achieve this, current MLLMs primarily treats bounding boxes as text tokens and generates them autoregressively. However, such autoregressive spatial decoding leads to very-long output sequences, causing spatial errors to accumulated over time and the localization results to progressively drift across a video. To address this, we present a Detector-Empowered Video LLM, short for DEViL, which couples a Video LLM with an open-vocabulary detector (OVD). Specifically, the MLLM and detector are connected via a reference-semantic token (RST) that distills the user query into a rich semantic representation. Unlike tokens that merely serve as spatial prompts or segmentor switches, the RST functions as both a control signal and a replacement for the OVD's text embedding, enabling end-to-end learning of both referential understanding and spatial localization. Furthermore, we propose a tube-mined temporal regularization (TTReg) within OVD, which drives the OVD to generate temporally-consistent queries for target objects, thereby ensuring effective temporal association. Experiments demonstrate that DEViL achieves strong performance across various fine-grained video understanding tasks, particularly STVG and GroundedVQA. Code will be released on https://github.com/gaostar123/DeViL.

1 + 1 > 2: Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning

TL;DR

<3-5 sentence high-level summary> DEViL tackles spatio-temporal grounding and reasoning by coupling a multimodal large language model with an open-vocabulary detector via a Reference-Semantic Token (RST). It introduces tube-mined temporal regularization (TTReg) to enforce cross-frame consistency and avoids the error-prone autoregressive, text-based coordinate decoding. A three-stage curriculum bridges the MLLM and detector and enables unified spatio-temporal grounding across STVG, TVG, and grounded VQA. Empirical results show strong spatio-temporal grounding and reasoning across multiple benchmarks, with robustness and improved efficiency on long videos.

Abstract

Spatio-temporal grounding and reasoning aims to locate the temporal segment and spatial region of an event in a video given a user query, while also reasoning about semantics such as causality, temporal order, and action relationships. To achieve this, current MLLMs primarily treats bounding boxes as text tokens and generates them autoregressively. However, such autoregressive spatial decoding leads to very-long output sequences, causing spatial errors to accumulated over time and the localization results to progressively drift across a video. To address this, we present a Detector-Empowered Video LLM, short for DEViL, which couples a Video LLM with an open-vocabulary detector (OVD). Specifically, the MLLM and detector are connected via a reference-semantic token (RST) that distills the user query into a rich semantic representation. Unlike tokens that merely serve as spatial prompts or segmentor switches, the RST functions as both a control signal and a replacement for the OVD's text embedding, enabling end-to-end learning of both referential understanding and spatial localization. Furthermore, we propose a tube-mined temporal regularization (TTReg) within OVD, which drives the OVD to generate temporally-consistent queries for target objects, thereby ensuring effective temporal association. Experiments demonstrate that DEViL achieves strong performance across various fine-grained video understanding tasks, particularly STVG and GroundedVQA. Code will be released on https://github.com/gaostar123/DeViL.

Paper Structure

This paper contains 18 sections, 10 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Comparing LLaVA-ST llavast and DEViL. (a) The mean IoU between predicted and ground-truth boxes (miOP) across the ground-truth interval. (b) Structure of LLaVA-ST. (c) Structure of our method, DEViL. For each video on VidSTG zhang2020does, we evenly split the ground-truth segment into 5 parts and compute miOP on each part, producing 5 points along the $\texttt{x}$-axis (1/5 to 5/5). Long-sequence suffer from error accumulation and localization drift.
  • Figure 2: Overall architecture of DEViL. Given a video and query, the MLLM encodes them and emits a special [BOX] token whose hidden state serves as the Reference-Semantic Token (RST). RST replaces the text embedding of the open-vocabulary detector (OVD) to drive object queries. A memory-based tube association maintains query identity across frames, while tube-mined temporal regularization (TTReg) regularizes ground-truth–aligned tubes to learn temporally consistent boxes. Note that the classification head of OVD is omitted for the purpose of simplifying expression and visualization.
  • Figure 3: Attention and detection comparison between the [BOX]-induced RST/text feature and image features (red: w/ TTReg; green: w/o). TTReg keeps attention and boxes on the target, while removing it causes scattered attention and jitter. Grounding DINO (yellow boxes) instead uses text–image attention that focuses on a distractor.
  • Figure 4: Qualitative comparison between LLaVA-ST and DEViL. For each example, the first row green shows LLaVA-ST’s predictions, while the second row red shows those of DEViL.
  • Figure 5: Auto-labeling process used to translate temporal video grounding datasets to spatio-temporal video grounding datasets.
  • ...and 6 more figures