Table of Contents
Fetching ...

Bridging Visual Representation and Reinforcement Learning from Verifiable Rewards in Large Vision-Language Models

Yuhang Han, Yuyang Wu, Zhengbo Jiao, Yiyu Wang, Xuyang Liu, Shaobo Wang, Hanlin Xu, Xuming Hu, Linfeng Zhang

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) has substantially enhanced the reasoning capabilities of large language models in abstract reasoning tasks. However, its application to Large Vision-Language Models (LVLMs) remains constrained by a structural representational bottleneck. Existing approaches generally lack explicit modeling and effective utilization of visual information, preventing visual representations from being tightly coupled with the reinforcement learning optimization process and thereby limiting further improvements in multimodal reasoning performance. To address this limitation, we propose KAWHI (Key-Region Aligned Weighted Harmonic Incentive), a plug-and-play reward reweighting mechanism that explicitly incorporates structured visual information into uniform reward policy optimization methods (e.g., GRPO and GSPO). The method adaptively localizes semantically salient regions through hierarchical geometric aggregation, identifies vision-critical attention heads via structured attribution, and performs paragraph-level credit reallocation to align spatial visual evidence with semantically decisive reasoning steps. Extensive empirical evaluations on diverse reasoning benchmarks substantiate KAWHI as a general-purpose enhancement module, consistently improving the performance of various uniform reward optimization methods. Project page: KAWHI (https://kawhiiiileo.github.io/KAWHI_PAGE/)

Bridging Visual Representation and Reinforcement Learning from Verifiable Rewards in Large Vision-Language Models

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) has substantially enhanced the reasoning capabilities of large language models in abstract reasoning tasks. However, its application to Large Vision-Language Models (LVLMs) remains constrained by a structural representational bottleneck. Existing approaches generally lack explicit modeling and effective utilization of visual information, preventing visual representations from being tightly coupled with the reinforcement learning optimization process and thereby limiting further improvements in multimodal reasoning performance. To address this limitation, we propose KAWHI (Key-Region Aligned Weighted Harmonic Incentive), a plug-and-play reward reweighting mechanism that explicitly incorporates structured visual information into uniform reward policy optimization methods (e.g., GRPO and GSPO). The method adaptively localizes semantically salient regions through hierarchical geometric aggregation, identifies vision-critical attention heads via structured attribution, and performs paragraph-level credit reallocation to align spatial visual evidence with semantically decisive reasoning steps. Extensive empirical evaluations on diverse reasoning benchmarks substantiate KAWHI as a general-purpose enhancement module, consistently improving the performance of various uniform reward optimization methods. Project page: KAWHI (https://kawhiiiileo.github.io/KAWHI_PAGE/)

Paper Structure

This paper contains 21 sections, 26 equations, 7 figures, 3 tables, 2 algorithms.

Figures (7)

  • Figure 1: (a) Error distribution across five categories: VP (visual perception error), AO (answer only), CE (calculation error), RAE (rule application error), and Others. (b) Two VP cases from MathVerse. Red highlights indicate visual misinterpretation.
  • Figure 2: Overview of the KAWHI mechanism. Critical regions are selected using the SGUF algorithm (see section \ref{['sec:SGUF']} for details). The decoded sequence denotes the textual output generated from response tokens, segmented by the paragraph delimiter \\ n\\ n.
  • Figure 3: Vision-Critical head identification via global ablation on the MME benchmark. Here, b denotes the performance score of the baseline models (Qwen2.5-VL-7B-Instructqwen25vl and Qwen3-VL-4B-Instruct qwen3vl).
  • Figure 4: Validation of VisionZip as an effective alternative to SGUF under 50% visual tokens.
  • Figure 5: Computational Efficiency Analysis of KAWHI. Component-wise isolation analysis using GRPO to assess the additional computational complexity introduced by our method.
  • ...and 2 more figures