Table of Contents
Fetching ...

Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning

Xin Gu, Haoji Zhang, Qihang Fan, Jingxuan Niu, Zhipeng Zhang, Libo Zhang, Guang Chen, Fan Chen, Longyin Wen, Sijie Zhu

TL;DR

This work tackles spatio-temporal video grounding (STVG) by leveraging off-the-shelf multimodal large language models (MLLMs) through reinforcement fine-tuning. It introduces a bounding-box chain-of-thought to explicitly reason about object locations over time, coupled with a multi-dimensional reward that aligns training with localization quality. The STVG-o1 framework achieves state-of-the-art results on HCSTVG-v1/v2, matches task-specific models on VidSTG, and demonstrates strong open-vocabulary generalization across datasets, illustrating the practical viability of MLLMs for precise spatio-temporal grounding. Overall, the approach shows that task-oriented reinforcement signals can unlock the grounding potential of general-purpose MLLMs without architectural modifications.

Abstract

Spatio-temporal video grounding (STVG) requires localizing a target object in untrimmed videos both temporally and spatially from natural language descriptions. Despite their strong language understanding, multimodal large language models (MLLMs) underperform on STVG due to misaligned training objectives and weak fine-grained region-word alignment in standard visual encoders. To address this, we propose STVG-o1, the first framework that enables off-the-shelf MLLMs to achieve state-of-the-art STVG performance without any architectural modifications. Our method introduces a bounding-box chain-of-thought mechanism that explicitly reasons about spatio-temporal locations in an intermediate step before producing the final prediction. We further design a multi-dimensional reinforcement reward function consisting of format, consistency, temporal, spatial, and think rewards, which provides geometry-aware supervision through reinforcement fine-tuning. Evaluated on HCSTVG-v1/v2 and VidSTG, STVG-o1 sets new state-of-the-art results on HCSTVG, outperforming the best task-specific method by 7.3\% m\_tIoU on HCSTVG-v1, matching specialized models on VidSTG, and surpassing all existing MLLM-based approaches by large margins. It also demonstrates strong open-vocabulary generalization across datasets, establishing MLLMs as viable and powerful backbones for precise spatio-temporal grounding. Our code and models will be released.

Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning

TL;DR

This work tackles spatio-temporal video grounding (STVG) by leveraging off-the-shelf multimodal large language models (MLLMs) through reinforcement fine-tuning. It introduces a bounding-box chain-of-thought to explicitly reason about object locations over time, coupled with a multi-dimensional reward that aligns training with localization quality. The STVG-o1 framework achieves state-of-the-art results on HCSTVG-v1/v2, matches task-specific models on VidSTG, and demonstrates strong open-vocabulary generalization across datasets, illustrating the practical viability of MLLMs for precise spatio-temporal grounding. Overall, the approach shows that task-oriented reinforcement signals can unlock the grounding potential of general-purpose MLLMs without architectural modifications.

Abstract

Spatio-temporal video grounding (STVG) requires localizing a target object in untrimmed videos both temporally and spatially from natural language descriptions. Despite their strong language understanding, multimodal large language models (MLLMs) underperform on STVG due to misaligned training objectives and weak fine-grained region-word alignment in standard visual encoders. To address this, we propose STVG-o1, the first framework that enables off-the-shelf MLLMs to achieve state-of-the-art STVG performance without any architectural modifications. Our method introduces a bounding-box chain-of-thought mechanism that explicitly reasons about spatio-temporal locations in an intermediate step before producing the final prediction. We further design a multi-dimensional reinforcement reward function consisting of format, consistency, temporal, spatial, and think rewards, which provides geometry-aware supervision through reinforcement fine-tuning. Evaluated on HCSTVG-v1/v2 and VidSTG, STVG-o1 sets new state-of-the-art results on HCSTVG, outperforming the best task-specific method by 7.3\% m\_tIoU on HCSTVG-v1, matching specialized models on VidSTG, and surpassing all existing MLLM-based approaches by large margins. It also demonstrates strong open-vocabulary generalization across datasets, establishing MLLMs as viable and powerful backbones for precise spatio-temporal grounding. Our code and models will be released.

Paper Structure

This paper contains 14 sections, 12 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Comparison between existing MLLM-based STVG methods in (a) and the proposed STVG-o1 method in (b). Best viewed in color for all figures.
  • Figure 2: Overview of STVG-o1. Given a video and a natural language query, the base MLLM generates a chain-of-thought output sequence $O_1, \dots, O_G$, where each step contains a temporal span, a sequence of thinking bounding boxes, and a sequence of final prediction bounding boxes. A reward model computes multi-dimensional rewards: format reward $\mathcal{R}_f$, consistency reward $\mathcal{R}_c$, temporal reward $\mathcal{R}_t$, spatial reward $\mathcal{R}_s$(combining GIoU and L1), and a think reward $\mathcal{R}_k$ that encourages refinement based on intermediate predictions. These rewards are aggregated to form a composite reward for reinforcement fine-tuning, enabling accurate spatio-temporal video grounding without architectural modifications. Best viewed in color for all figures.
  • Figure 3: Qualitative results of our STVG-o1. Green boxes denote ground truth, blue boxes represent intermediate <think_bbox> predictions during reasoning, and red boxes indicate final <pred_bbox> outputs. Best viewed in color for all figures.
  • Figure 4: Analysis of average vIoU across different bounding box areas. Blue curve shows performance of intermediate <think_bbox> predictions, red curve shows final <pred_bbox> outputs, and green bars indicate the relative improvement ($\Delta$ vIoU) from think bbox to predicted bbox.
  • Figure 5: Qualitative results of our STVG-o1. Green boxes denote ground truth, blue boxes represent intermediate <think_bbox> predictions during reasoning, and red boxes indicate final <pred_bbox> outputs. Best viewed in color for all figures.
  • ...and 3 more figures