Table of Contents
Fetching ...

STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning

Xiaowen Zhang, Zhi Gao, Licheng Jiao, Lingling Li, Qing Li

TL;DR

This work tackles hallucinations and misalignment in dense video grounding by introducing object-centric visual prompts that assign temporally consistent IDs to objects, reframing per-frame coordinate prediction as an instance-level identification task. It then couples this prompting scheme with a reinforcement learning framework (GRPO-based) that optimizes temporal precision, spatial consistency, and reasoning format, achieving state-of-the-art results across six benchmarks and strong zero-shot transfer to multi-object video segmentation. The approach relies on an automatic object-detection and tracking pipeline to generate prompt-guided video representations and demonstrates that explicit instance identifiers improve coherent reasoning and grounding across time and space. Empirically, STVG-R1 delivers substantial gains over strong baselines with minimal visual-prompt interference, indicating a scalable and annotation-efficient path for robust STVG and related tasks.

Abstract

In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations. This issue becomes particularly severe in dense prediction tasks such as spatial-temporal video grounding (STVG). Prior approaches typically focus on enhancing visual-textual alignment or attaching auxiliary decoders. However, these strategies inevitably introduce additional trainable modules, leading to significant annotation costs and computational overhead. In this work, we propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities. Specifically, we reformulate per-frame coordinate prediction as a compact instance-level identification problem by assigning each object a unique, temporally consistent ID. These IDs are embedded into the video as visual prompts, providing explicit and interpretable inputs to the VLMs. Furthermore, we introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization. Extensive experiments on six benchmarks demonstrate the effectiveness of our approach. STVG-R1 surpasses the baseline Qwen2.5-VL-7B by a remarkable margin of 20.9% on m_IoU on the HCSTVG-v2 benchmark, establishing a new state of the art (SOTA). Surprisingly, STVG-R1 also exhibits strong zero-shot generalization to multi-object referring video object segmentation tasks, achieving a SOTA 47.3% J&F on MeViS.

STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning

TL;DR

This work tackles hallucinations and misalignment in dense video grounding by introducing object-centric visual prompts that assign temporally consistent IDs to objects, reframing per-frame coordinate prediction as an instance-level identification task. It then couples this prompting scheme with a reinforcement learning framework (GRPO-based) that optimizes temporal precision, spatial consistency, and reasoning format, achieving state-of-the-art results across six benchmarks and strong zero-shot transfer to multi-object video segmentation. The approach relies on an automatic object-detection and tracking pipeline to generate prompt-guided video representations and demonstrates that explicit instance identifiers improve coherent reasoning and grounding across time and space. Empirically, STVG-R1 delivers substantial gains over strong baselines with minimal visual-prompt interference, indicating a scalable and annotation-efficient path for robust STVG and related tasks.

Abstract

In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations. This issue becomes particularly severe in dense prediction tasks such as spatial-temporal video grounding (STVG). Prior approaches typically focus on enhancing visual-textual alignment or attaching auxiliary decoders. However, these strategies inevitably introduce additional trainable modules, leading to significant annotation costs and computational overhead. In this work, we propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities. Specifically, we reformulate per-frame coordinate prediction as a compact instance-level identification problem by assigning each object a unique, temporally consistent ID. These IDs are embedded into the video as visual prompts, providing explicit and interpretable inputs to the VLMs. Furthermore, we introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization. Extensive experiments on six benchmarks demonstrate the effectiveness of our approach. STVG-R1 surpasses the baseline Qwen2.5-VL-7B by a remarkable margin of 20.9% on m_IoU on the HCSTVG-v2 benchmark, establishing a new state of the art (SOTA). Surprisingly, STVG-R1 also exhibits strong zero-shot generalization to multi-object referring video object segmentation tasks, achieving a SOTA 47.3% J&F on MeViS.
Paper Structure (34 sections, 8 equations, 16 figures, 11 tables, 1 algorithm)

This paper contains 34 sections, 8 equations, 16 figures, 11 tables, 1 algorithm.

Figures (16)

  • Figure 1: Comparisons of general VLMs, specialized VLMs, and proposed STVG-R1 model. While Qwen2.5-VL-7B outputs a single meaningless bounding box without timestamps, LLaVA-ST is restricted to one bounding box per frame. In contrast, STVG-R1 achieves strong performance on both spatial–temporal video grounding and zero-shot multi-object referring video object segmentation.
  • Figure 2: Comparison of paradigms: (a) VLM produces both timestamps and frame-level coordinates with a trainable alignment block; (b) VLM generates segmentation tokens, which are then processed by a trainable decoder; (c) our method uses training-free object-centric visual prompted video for spatial-temporal video grounding.
  • Figure 3: An illustration of our proposed STVG-R1 framework. Each object is assigned a unique ID via visual prompts, and the policy model is trained with spatial, temporal, and template rewards.
  • Figure 4: Case study of STVG-R1 on the spatial-temporal video grounding task.
  • Figure 5: Prompt for spatial-temporal video grounding.
  • ...and 11 more figures