Table of Contents
Fetching ...

Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images

Wenhao Yang, Yu Xia, Jinlong Huang, Shiyin Lu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Yuanyu Wan, Lijun Zhang

TL;DR

DRIM tackles unreliable multi-turn reasoning in vision-language models by building a high-difficulty, verifiable visual QA dataset and training with a three-stage pipeline: cold-start supervised fine-tuning, followed by reinforcement learning with a redundancy-penalized objective to encourage self-reflection and broad multi-scale exploration. The method enables iterative zoom-in tool calls on high-resolution images, producing deep but reliable multimodal chain-of-thought. Experiments on high-resolution benchmarks show DRIM achieves state-of-the-art or competitive performance and the ablations validate the importance of each component. The work advances thinking-with-images paradigms and provides datasets, prompts, and training schemes for future research.

Abstract

Recent advances in large Vision-Language Models (VLMs) have exhibited strong reasoning capabilities on complex visual tasks by thinking with images in their Chain-of-Thought (CoT), which is achieved by actively invoking tools to analyze visual inputs rather than merely perceiving them. However, existing models often struggle to reflect on and correct themselves when attempting incorrect reasoning trajectories. To address this limitation, we propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT. Our pipeline comprises three stages: data construction, cold-start SFT and RL. Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs, where solving each task requires multi-turn tool calls to reach the correct answer. In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern. In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern. The basic idea is to impose judgment on reasoning trajectories and penalize those that produce incorrect answers without sufficient multi-scale exploration. Extensive experiments demonstrate that DRIM achieves superior performance on visual understanding benchmarks.

Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images

TL;DR

DRIM tackles unreliable multi-turn reasoning in vision-language models by building a high-difficulty, verifiable visual QA dataset and training with a three-stage pipeline: cold-start supervised fine-tuning, followed by reinforcement learning with a redundancy-penalized objective to encourage self-reflection and broad multi-scale exploration. The method enables iterative zoom-in tool calls on high-resolution images, producing deep but reliable multimodal chain-of-thought. Experiments on high-resolution benchmarks show DRIM achieves state-of-the-art or competitive performance and the ablations validate the importance of each component. The work advances thinking-with-images paradigms and provides datasets, prompts, and training schemes for future research.

Abstract

Recent advances in large Vision-Language Models (VLMs) have exhibited strong reasoning capabilities on complex visual tasks by thinking with images in their Chain-of-Thought (CoT), which is achieved by actively invoking tools to analyze visual inputs rather than merely perceiving them. However, existing models often struggle to reflect on and correct themselves when attempting incorrect reasoning trajectories. To address this limitation, we propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT. Our pipeline comprises three stages: data construction, cold-start SFT and RL. Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs, where solving each task requires multi-turn tool calls to reach the correct answer. In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern. In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern. The basic idea is to impose judgment on reasoning trajectories and penalize those that produce incorrect answers without sufficient multi-scale exploration. Extensive experiments demonstrate that DRIM achieves superior performance on visual understanding benchmarks.

Paper Structure

This paper contains 31 sections, 3 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Illustration of DRIM performing multi-turn reasoning to tackle a visual search task. Our model thinks with images in its MCoT, invoking the zoom-in tool to crop the image and analyze it more thoroughly. In addition, DRIM can reflect and self-correct during the reasoning process (highlighted in blue), thereby localizing the correct region and producing the final answer (highlighted in red).
  • Figure 2: Overview of Agentic Pipeline
  • Figure 3: Reward Signal in RL training
  • Figure 5: Overview of the overall pipeline for implementing DRIM. Our pipeline comprises three stages: data construction, cold-start SFT and RL. First, we construct a new multimodal dataset, and synthesize multi-turn tool call trajectories to serve as cold-start data. Second, the synthesized trajectories are used to SFT the model, enabling it to acquire tool-use abilities and multi-turn reasoning. Third, we design reward signals that encourage the model to autonomously explore and optimize its reasoning trajectories during RL training.
  • Figure 6: Our automated scheme for data construction. In our scheme, the frontier VLMs select and zoom into the regions of interest, and then generate QA pairs on the specific region.
  • ...and 8 more figures