Table of Contents
Fetching ...

VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use

Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, Klara Nahrstedt

TL;DR

The paper tackles the challenge of enabling vision-language models to perform multimodal reasoning by training them to generate multimodal chains of thought that interleave textual reasoning with intermediate visual steps edited via Python tools. It introduces VTool-R1, which combines reinforcement learning finetuning with outcome-based rewards and GRPO to train VLMs to decide when and how to apply visual tool edits for improved reasoning. The approach is validated on chart- and table-based VQA tasks, showing improved reasoning accuracy and the emergence of adaptive, coherent multimodal reasoning. This work demonstrates a scalable path toward richer multimodal cognition by leveraging external tools to extend what models can reason about beyond their static training data.

Abstract

Reinforcement Learning Finetuning (RFT) has significantly advanced the reasoning capabilities of large language models (LLMs) by enabling long chains of thought, self-correction, and effective tool use. While recent works attempt to extend RFT to vision-language models (VLMs), these efforts largely produce text-only reasoning conditioned on static image inputs, falling short of true multimodal reasoning in the response. In contrast, test-time methods like Visual Sketchpad incorporate visual steps but lack training mechanisms. We introduce VTool-R1, the first framework that trains VLMs to generate multimodal chains of thought by interleaving text and intermediate visual reasoning steps. VTool-R1 integrates Python-based visual editing tools into the RFT process, enabling VLMs to learn when and how to generate visual reasoning steps that benefit final reasoning. Trained with outcome-based rewards tied to task accuracy, our approach elicits strategic visual tool use for reasoning without relying on process-based supervision. Experiments on structured visual question answering over charts and tables show that VTool-R1 enhances reasoning performance by teaching VLMs to "think with images" and generate multimodal chain of thoughts with tools.

VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use

TL;DR

The paper tackles the challenge of enabling vision-language models to perform multimodal reasoning by training them to generate multimodal chains of thought that interleave textual reasoning with intermediate visual steps edited via Python tools. It introduces VTool-R1, which combines reinforcement learning finetuning with outcome-based rewards and GRPO to train VLMs to decide when and how to apply visual tool edits for improved reasoning. The approach is validated on chart- and table-based VQA tasks, showing improved reasoning accuracy and the emergence of adaptive, coherent multimodal reasoning. This work demonstrates a scalable path toward richer multimodal cognition by leveraging external tools to extend what models can reason about beyond their static training data.

Abstract

Reinforcement Learning Finetuning (RFT) has significantly advanced the reasoning capabilities of large language models (LLMs) by enabling long chains of thought, self-correction, and effective tool use. While recent works attempt to extend RFT to vision-language models (VLMs), these efforts largely produce text-only reasoning conditioned on static image inputs, falling short of true multimodal reasoning in the response. In contrast, test-time methods like Visual Sketchpad incorporate visual steps but lack training mechanisms. We introduce VTool-R1, the first framework that trains VLMs to generate multimodal chains of thought by interleaving text and intermediate visual reasoning steps. VTool-R1 integrates Python-based visual editing tools into the RFT process, enabling VLMs to learn when and how to generate visual reasoning steps that benefit final reasoning. Trained with outcome-based rewards tied to task accuracy, our approach elicits strategic visual tool use for reasoning without relying on process-based supervision. Experiments on structured visual question answering over charts and tables show that VTool-R1 enhances reasoning performance by teaching VLMs to "think with images" and generate multimodal chain of thoughts with tools.

Paper Structure

This paper contains 18 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Multi-Modal GRPO w. Tool Use Training Pipeline, where the input $q$ is a multimodal query
  • Figure 2: Qualitative Example from VTool-R1 (3B): The Model Successfully Integrates Intermediate Visual Steps.
  • Figure 3: Multi-Modal GRPO w. Tool Use Training Dynamics, for 3B models
  • Figure 4: Multi-Modal GRPO w. Tool Use Training Dynamics, for 32B models