Table of Contents
Fetching ...

VisRL: Intention-Driven Visual Perception via Reinforced Reasoning

Zhangquan Chen, Xufang Luo, Dongsheng Li

TL;DR

The paper tackles the annotation bottleneck in intention-driven visual perception by removing the need for dense bounding-box supervision. It introduces VisRL, a reinforcement learning framework that optimizes the entire visual reasoning process using step-level Direct Preference Optimization with self-generated data. The method uses a two-stage RL pipeline—first refining the focus region and then reasoning over the cropped region—driven solely by task rewards. Experimental results across multiple benchmarks and backbones show VisRL achieving consistent improvements and strong generalization without external tools.

Abstract

Visual understanding is inherently intention-driven - humans selectively focus on different regions of a scene based on their goals. Recent advances in large multimodal models (LMMs) enable flexible expression of such intentions through natural language, allowing queries to guide visual reasoning processes. Frameworks like Visual Chain-of-Thought have demonstrated the benefit of incorporating explicit reasoning steps, where the model predicts a focus region before answering a query. However, existing approaches rely heavily on supervised training with annotated intermediate bounding boxes, which severely limits scalability due to the combinatorial explosion of intention-region pairs. To overcome this limitation, we propose VisRL, the first framework that applies reinforcement learning (RL) to the problem of intention-driven visual perception. VisRL optimizes the entire visual reasoning process using only reward signals. By treating intermediate focus selection as an internal decision optimized through trial-and-error, our method eliminates the need for costly region annotations while aligning more closely with how humans learn to perceive the world. Extensive experiments across multiple benchmarks show that VisRL consistently outperforms strong baselines, demonstrating both its effectiveness and its strong generalization across different LMMs. Our code is available at https://github.com/zhangquanchen/VisRL.

VisRL: Intention-Driven Visual Perception via Reinforced Reasoning

TL;DR

The paper tackles the annotation bottleneck in intention-driven visual perception by removing the need for dense bounding-box supervision. It introduces VisRL, a reinforcement learning framework that optimizes the entire visual reasoning process using step-level Direct Preference Optimization with self-generated data. The method uses a two-stage RL pipeline—first refining the focus region and then reasoning over the cropped region—driven solely by task rewards. Experimental results across multiple benchmarks and backbones show VisRL achieving consistent improvements and strong generalization without external tools.

Abstract

Visual understanding is inherently intention-driven - humans selectively focus on different regions of a scene based on their goals. Recent advances in large multimodal models (LMMs) enable flexible expression of such intentions through natural language, allowing queries to guide visual reasoning processes. Frameworks like Visual Chain-of-Thought have demonstrated the benefit of incorporating explicit reasoning steps, where the model predicts a focus region before answering a query. However, existing approaches rely heavily on supervised training with annotated intermediate bounding boxes, which severely limits scalability due to the combinatorial explosion of intention-region pairs. To overcome this limitation, we propose VisRL, the first framework that applies reinforcement learning (RL) to the problem of intention-driven visual perception. VisRL optimizes the entire visual reasoning process using only reward signals. By treating intermediate focus selection as an internal decision optimized through trial-and-error, our method eliminates the need for costly region annotations while aligning more closely with how humans learn to perceive the world. Extensive experiments across multiple benchmarks show that VisRL consistently outperforms strong baselines, demonstrating both its effectiveness and its strong generalization across different LMMs. Our code is available at https://github.com/zhangquanchen/VisRL.

Paper Structure

This paper contains 25 sections, 10 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Illustration of using RL to optimize the visual reasoning process. SFT trains with densely annotated training data for several epochs. VisRL leverages self-generated data and self-provided rewards to iteratively update the model using step-level DPO. This RL process removes the need for bounding box annotations, enabling a more human-like, intention-driven visual perception.
  • Figure 2: The schematic illustration of our VisRL framework. VisRL first utilizes a small amount of data for SFT warm-up, but in the subsequent RL training phase, it can leverage large-scale data without bounding box annotations. The RL phase of VisRL consists of iterative cycles of data generation and optimization, and $k$ in the figure indicates the iteration index. The data generation process does not rely on external models or annotations; instead, it employs the model itself for data synthesis and scoring. The optimization step adopts step-level DPO to ensure the model learns each step of the reasoning process. In summary, VisRL enables intention-driven visual perception by leveraging RL to learn from task rewards without requiring annotations and external helps.
  • Figure 3: The schematic illustration of our data generation pipeline. Here $\mathcal{M}_{RL}^k$ denotes the model updated with $k$ iterations of data generation and optimization, and $\mathcal{M}_{org}$ is the original model. VisRL uses $\mathcal{M}_{RL}^k$ to generate samples, and use $\mathcal{M}_{org}$ to provide rewards. Hence, different versions of a single model are used in this self-evolution data generation process, and no bounding box annotations and external models are introduced in this process.
  • Figure 4: Performance of our VisRL over multiple iterations, attributing to the intertwined improvement of data quality and model capability during the iterative process. The accuracy is calculated as the average value over the 11 datasets listed in Tab. \ref{['table:mllm']}.
  • Figure 5: Visualization of LLaVa-1.5 vs. VisCoT vs. VisRL (based on LLaVa-1.5). GT bounding boxes are shown in blue, VisCoT-generated bounding boxes are shown in red, while Ours-generated bounding boxes are in green. The scores are evaluated by the GPT-4o. Our method consistently delivers the best results across various benchmarks. More visualizations are in the Supp. Mat..
  • ...and 6 more figures