STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning

Xiaowen Zhang; Zhi Gao; Licheng Jiao; Lingling Li; Qing Li

STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning

Xiaowen Zhang, Zhi Gao, Licheng Jiao, Lingling Li, Qing Li

TL;DR

This work tackles hallucinations and misalignment in dense video grounding by introducing object-centric visual prompts that assign temporally consistent IDs to objects, reframing per-frame coordinate prediction as an instance-level identification task. It then couples this prompting scheme with a reinforcement learning framework (GRPO-based) that optimizes temporal precision, spatial consistency, and reasoning format, achieving state-of-the-art results across six benchmarks and strong zero-shot transfer to multi-object video segmentation. The approach relies on an automatic object-detection and tracking pipeline to generate prompt-guided video representations and demonstrates that explicit instance identifiers improve coherent reasoning and grounding across time and space. Empirically, STVG-R1 delivers substantial gains over strong baselines with minimal visual-prompt interference, indicating a scalable and annotation-efficient path for robust STVG and related tasks.

Abstract

In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations. This issue becomes particularly severe in dense prediction tasks such as spatial-temporal video grounding (STVG). Prior approaches typically focus on enhancing visual-textual alignment or attaching auxiliary decoders. However, these strategies inevitably introduce additional trainable modules, leading to significant annotation costs and computational overhead. In this work, we propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities. Specifically, we reformulate per-frame coordinate prediction as a compact instance-level identification problem by assigning each object a unique, temporally consistent ID. These IDs are embedded into the video as visual prompts, providing explicit and interpretable inputs to the VLMs. Furthermore, we introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization. Extensive experiments on six benchmarks demonstrate the effectiveness of our approach. STVG-R1 surpasses the baseline Qwen2.5-VL-7B by a remarkable margin of 20.9% on m_IoU on the HCSTVG-v2 benchmark, establishing a new state of the art (SOTA). Surprisingly, STVG-R1 also exhibits strong zero-shot generalization to multi-object referring video object segmentation tasks, achieving a SOTA 47.3% J&F on MeViS.

STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning

TL;DR

Abstract

Paper Structure (34 sections, 8 equations, 16 figures, 11 tables, 1 algorithm)

This paper contains 34 sections, 8 equations, 16 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Spatial Temporal Video Grounding
Reinforcement Learning in VLMs
Method
STVG-R1 Framework
Object-Centric Prompted Video Construction
Enhancing VLMs with Reinforcement Learning
Training Strategies
Experiments
Setting
Evaluation Results on Spatial Temporal Video Grounding
Evaluation Results on Video Spatial Grounding
Evaluation Results on Video Temporal Grounding
Ablation
...and 19 more sections

Figures (16)

Figure 1: Comparisons of general VLMs, specialized VLMs, and proposed STVG-R1 model. While Qwen2.5-VL-7B outputs a single meaningless bounding box without timestamps, LLaVA-ST is restricted to one bounding box per frame. In contrast, STVG-R1 achieves strong performance on both spatial–temporal video grounding and zero-shot multi-object referring video object segmentation.
Figure 2: Comparison of paradigms: (a) VLM produces both timestamps and frame-level coordinates with a trainable alignment block; (b) VLM generates segmentation tokens, which are then processed by a trainable decoder; (c) our method uses training-free object-centric visual prompted video for spatial-temporal video grounding.
Figure 3: An illustration of our proposed STVG-R1 framework. Each object is assigned a unique ID via visual prompts, and the policy model is trained with spatial, temporal, and template rewards.
Figure 4: Case study of STVG-R1 on the spatial-temporal video grounding task.
Figure 5: Prompt for spatial-temporal video grounding.
...and 11 more figures

STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning

TL;DR

Abstract

STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (16)