Table of Contents
Fetching ...

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

Xinyu Huang, Yuhao Dong, Weiwei Tian, Bo Li, Rui Feng, Ziwei Liu

TL;DR

This work tackles high-resolution visual reasoning in large multimodal models by introducing MGPO, a multi-turn grounding-based reinforcement learning framework that iteratively crops sub-images based on model-predicted grounding coordinates. MGPO operates with a fixed two-turn template to overcome cold-start issues and relies on a binary final-answer reward, enabling emergent grounding capabilities without explicit grounding annotations. Empirical results on MME-Realworld and V* Bench show MGPO achieving notable gains over SFT and GRPO, including surpassing OpenAI o1 and GPT-4o with only 21k training samples. The approach provides interpretable visual grounding and effectively mitigates maximum-pixel constraints in high-resolution tasks, suggesting practical impact for real-world visual reasoning systems.

Abstract

State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the downstream task. In this paper, we propose Multi-turn Grounding-based Policy Optimization (MGPO), an end-to-end reinforcement learning (RL) framework that enables LMMs to iteratively focus on key visual regions by automatically cropping sub-images, based on model-predicted grounding coordinates within a multi-turn conversation framework. Compared to supervised fine-tuning (SFT), which requires costly additional grounding annotations, our approach highlights that LMMs can emerge robust grounding abilities during the RL training process, leveraging only a binary reward function derived from the correctness of the final answer. Additionally, we observe that LMMs struggle to autonomously trigger visual grounding during the rollout process. To address this cold start problem, we design a multi-turn conversational template and restrict policy loss computation to model outputs generated across multiple dialogue rounds, thereby promoting stable optimization. Extensive experiments demonstrate that, when trained on standard visual-question-short answering data without grounding annotations, MGPO effectively elicits stronger grounding capabilities compared to GRPO, leading to 5.4\% improvement on in-distribution MME-Realworld and 5.2\% improvement on the challenging out-of-distribution (OOD) V* Bench. Notably, MGPO post-training on Qwen2.5-VL-7B with 21K samples surpasses OpenAI's o1 and GPT-4o models on the OOD V* Bench. Codes are available at https://github.com/EvolvingLMMs-Lab/MGPO.

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

TL;DR

This work tackles high-resolution visual reasoning in large multimodal models by introducing MGPO, a multi-turn grounding-based reinforcement learning framework that iteratively crops sub-images based on model-predicted grounding coordinates. MGPO operates with a fixed two-turn template to overcome cold-start issues and relies on a binary final-answer reward, enabling emergent grounding capabilities without explicit grounding annotations. Empirical results on MME-Realworld and V* Bench show MGPO achieving notable gains over SFT and GRPO, including surpassing OpenAI o1 and GPT-4o with only 21k training samples. The approach provides interpretable visual grounding and effectively mitigates maximum-pixel constraints in high-resolution tasks, suggesting practical impact for real-world visual reasoning systems.

Abstract

State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the downstream task. In this paper, we propose Multi-turn Grounding-based Policy Optimization (MGPO), an end-to-end reinforcement learning (RL) framework that enables LMMs to iteratively focus on key visual regions by automatically cropping sub-images, based on model-predicted grounding coordinates within a multi-turn conversation framework. Compared to supervised fine-tuning (SFT), which requires costly additional grounding annotations, our approach highlights that LMMs can emerge robust grounding abilities during the RL training process, leveraging only a binary reward function derived from the correctness of the final answer. Additionally, we observe that LMMs struggle to autonomously trigger visual grounding during the rollout process. To address this cold start problem, we design a multi-turn conversational template and restrict policy loss computation to model outputs generated across multiple dialogue rounds, thereby promoting stable optimization. Extensive experiments demonstrate that, when trained on standard visual-question-short answering data without grounding annotations, MGPO effectively elicits stronger grounding capabilities compared to GRPO, leading to 5.4\% improvement on in-distribution MME-Realworld and 5.2\% improvement on the challenging out-of-distribution (OOD) V* Bench. Notably, MGPO post-training on Qwen2.5-VL-7B with 21K samples surpasses OpenAI's o1 and GPT-4o models on the OOD V* Bench. Codes are available at https://github.com/EvolvingLMMs-Lab/MGPO.

Paper Structure

This paper contains 19 sections, 4 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: Examples of models trained with multi-turn grounding-based RL on high-resolution real-world tasks. The model first identifies key regions, which are then automatically cropped and returned as sub-images. Notably, despite only a binary reward function derived from the correctness of the final answer, the model gradually emerge robust grounding capability throughout the RL process. The conversation in the figure only shows key parts, the full conversation is provided in Appendix \ref{['fig:specific_outputs']}.
  • Figure 2: Comparison of different post-training paradigms for LMMs. Our MGPO automatically crops and returns sub-image to the model based on its predicted grounding coordinates, enabling the model to iteratively focus on key regions and effectively solve high-resolution visual tasks.
  • Figure 3: Fixed multi-turn grounding template, which eliminate cold start SFT process.
  • Figure 4: Distribution of image resolutions (width $\times$ height) across different datasets.
  • Figure 4: Performance comparison of image count task. Additional point reward do not lead to significant performance improvements.
  • ...and 5 more figures