Table of Contents
Fetching ...

VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting

Daeun Lee, Shoubin Yu, Yue Zhang, Mohit Bansal

Abstract

Video reasoning requires models to locate and track question-relevant evidence across frames. While reinforcement learning (RL) with verifiable rewards improves accuracy, it still struggles to achieve reliable spatio-temporal grounding during the reasoning process. Moreover, improving grounding typically relies on scaled training data or inference-time perception tools, which increases annotation cost or computational cost. To address this challenge, we propose VisonCoach, an input-adaptive RL framework that improves spatio-temporal grounding through visual prompting as training-time guidance. During RL training, visual prompts are selectively applied to challenging inputs to amplify question-relevant evidence and suppress distractors. The model then internalizes these improvements through self-distillation, enabling grounded reasoning directly on raw videos without visual prompting at inference. VisonCoach consists of two components: (1) Visual Prompt Selector, which predicts appropriate prompt types conditioned on the video and question, and (2) Spatio-Temporal Reasoner, optimized with RL under visual prompt guidance and object-aware grounding rewards that enforce object identity consistency and multi-region bounding-box overlap. Extensive experiments demonstrate that VisonCoach achieves state-of-the-art performance under comparable settings, across diverse video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA), while maintaining a single efficient inference pathway without external tools. Our results show that visual prompting during training improves grounded video reasoning, while self-distillation enables the model to internalize this ability without requiring prompts at inference time.

VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting

Abstract

Video reasoning requires models to locate and track question-relevant evidence across frames. While reinforcement learning (RL) with verifiable rewards improves accuracy, it still struggles to achieve reliable spatio-temporal grounding during the reasoning process. Moreover, improving grounding typically relies on scaled training data or inference-time perception tools, which increases annotation cost or computational cost. To address this challenge, we propose VisonCoach, an input-adaptive RL framework that improves spatio-temporal grounding through visual prompting as training-time guidance. During RL training, visual prompts are selectively applied to challenging inputs to amplify question-relevant evidence and suppress distractors. The model then internalizes these improvements through self-distillation, enabling grounded reasoning directly on raw videos without visual prompting at inference. VisonCoach consists of two components: (1) Visual Prompt Selector, which predicts appropriate prompt types conditioned on the video and question, and (2) Spatio-Temporal Reasoner, optimized with RL under visual prompt guidance and object-aware grounding rewards that enforce object identity consistency and multi-region bounding-box overlap. Extensive experiments demonstrate that VisonCoach achieves state-of-the-art performance under comparable settings, across diverse video reasoning, video understanding, and temporal grounding benchmarks (V-STAR, VideoMME, World-Sense, VideoMMMU, PerceptionTest, and Charades-STA), while maintaining a single efficient inference pathway without external tools. Our results show that visual prompting during training improves grounded video reasoning, while self-distillation enables the model to internalize this ability without requiring prompts at inference time.
Paper Structure (30 sections, 17 equations, 11 figures, 13 tables, 1 algorithm)

This paper contains 30 sections, 17 equations, 11 figures, 13 tables, 1 algorithm.

Figures (11)

  • Figure 1: Comparison with previous video reasoning methods and VisionCoach.VisionCoach leverages visual-prompt guided RL with self-distillation to internalize improved spatio-temporal grounding behaviors induced by visual guidance. During inference, it maintains a single forward-pass reasoning while achieving enhanced grounding performance.
  • Figure 2: Detailed Architecture of VisionCoach. We introduce VisionCoach, a visual-prompt-guided RL framework for training a spatio-temporally grounded reasoner (\ref{['sec:method-pipeline']}). The framework includes Visual Prompt Selector (VP-Selector) that predicts optimal visual prompts (\ref{['sec:method-vpselector']}), and object-aware spatial grounding rewards that enforce object identity consistency and multiple predicted bounding boxes (\ref{['sec:method-reward']}).
  • Figure 3: Grounding effect on answering.
  • Figure 4: More analysis of VisionCoach. We provide analysis including (a) statistics of visual prompting, (b) inference latency, and (c) the effect of visual prompting on spatial grounding reward.
  • Figure 5: Ablation of VisionCoach on V-STAR.$r_{\text{spa}}$: object-aware spatial grounding reward. $\mathcal{L}_{\mathrm{SD}}$: self-distillation. VP-F: fixed visual prompting (darken). VP-S: VP-Selector.
  • ...and 6 more figures