Table of Contents
Fetching ...

Annotation-Free Visual Reasoning for High-Resolution Large Multimodal Models via Reinforcement Learning

Jiacheng Yang, Anqi Chen, Yunkai Dang, Qi Fan, Cong Wang, Wenbin Li, Feng Miao, Yang Gao

TL;DR

High-resolution Annotation-free Reasoning Technique (HART), a closed-loop framework that enables LMMs to focus on and self-verify key regions of high-resolution visual inputs, and incorporates a post-training paradigm in which it incorporates Advantage Preference Group Relative Policy Optimization (AP-GRPO) to encourage accurate localization of key regions.

Abstract

Current Large Multimodal Models (LMMs) struggle with high-resolution visual inputs during the reasoning process, as the number of image tokens increases quadratically with resolution, introducing substantial redundancy and irrelevant information. A common practice is to identify key image regions and refer to their high-resolution counterparts during reasoning, typically trained with external visual supervision. However, such visual supervision cues require costly grounding labels from human annotators. Meanwhile, it remains an open question how to enhance a model's grounding abilities to support reasoning without relying on additional annotations. In this paper, we propose High-resolution Annotation-free Reasoning Technique (HART), a closed-loop framework that enables LMMs to focus on and self-verify key regions of high-resolution visual inputs. HART incorporates a post-training paradigm in which we design Advantage Preference Group Relative Policy Optimization (AP-GRPO) to encourage accurate localization of key regions. Notably, HART provides explainable reasoning pathways and enables efficient optimization of localization. Extensive experiments demonstrate that HART improves performance across a wide range of high-resolution visual tasks, consistently outperforming strong baselines. When applied to post-train Qwen2.5-VL-7B, HART even surpasses larger-scale models such as Qwen2.5-VL-72B and LLaVA-OneVision-72B on high-resolution, vision-centric benchmarks.

Annotation-Free Visual Reasoning for High-Resolution Large Multimodal Models via Reinforcement Learning

TL;DR

High-resolution Annotation-free Reasoning Technique (HART), a closed-loop framework that enables LMMs to focus on and self-verify key regions of high-resolution visual inputs, and incorporates a post-training paradigm in which it incorporates Advantage Preference Group Relative Policy Optimization (AP-GRPO) to encourage accurate localization of key regions.

Abstract

Current Large Multimodal Models (LMMs) struggle with high-resolution visual inputs during the reasoning process, as the number of image tokens increases quadratically with resolution, introducing substantial redundancy and irrelevant information. A common practice is to identify key image regions and refer to their high-resolution counterparts during reasoning, typically trained with external visual supervision. However, such visual supervision cues require costly grounding labels from human annotators. Meanwhile, it remains an open question how to enhance a model's grounding abilities to support reasoning without relying on additional annotations. In this paper, we propose High-resolution Annotation-free Reasoning Technique (HART), a closed-loop framework that enables LMMs to focus on and self-verify key regions of high-resolution visual inputs. HART incorporates a post-training paradigm in which we design Advantage Preference Group Relative Policy Optimization (AP-GRPO) to encourage accurate localization of key regions. Notably, HART provides explainable reasoning pathways and enables efficient optimization of localization. Extensive experiments demonstrate that HART improves performance across a wide range of high-resolution visual tasks, consistently outperforming strong baselines. When applied to post-train Qwen2.5-VL-7B, HART even surpasses larger-scale models such as Qwen2.5-VL-72B and LLaVA-OneVision-72B on high-resolution, vision-centric benchmarks.
Paper Structure (18 sections, 2 theorems, 6 equations, 4 figures, 6 tables)

This paper contains 18 sections, 2 theorems, 6 equations, 4 figures, 6 tables.

Key Result

Proposition 1

Let $I_{\text{HART}}(L;R)$ and $I_{\text{baseline}}(L;R)$ denote the mutual information between localization correctness $L$ and response correctness $R$ under our method and baselines, respectively. Then,

Figures (4)

  • Figure 1: Optimization procedures of (a) general grounding-based methods without bounding-box annotations and (b) our proposed model. General models indirectly optimize grounding performance, while HART performs direct optimization by answering based solely on the ROIs. Abbreviations: Q—Question; A—Answer.
  • Figure 2: Left: An example of Qwen-2.5-VL-7B where the final answer is correct but the grounding is incorrect. Right: HART Framework. The post-training strategy consists of Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT). In stage 1, after identifying the ROIs, the model answers based solely on the sub-regions and the original question. AP-GRPO is introduced to improve the model’s grounding capabilities. In stage 2, HART uses SFT to further enhance the high-resolution reasoning capabilities.
  • Figure 3: Grounding performance of HART on TreeBench-Perception and TreeBench-Reasoning.
  • Figure 4: Visualization of model outputs from InternVL3-8B zhu2025internvl3, Qwen2.5-VL-7B bai2025qwen2, and our method HART-7B on TreeBench wang2025traceable.

Theorems & Definitions (2)

  • Proposition 1
  • Proposition 2