Table of Contents
Fetching ...

Visual Agentic Reinforcement Fine-Tuning

Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang

TL;DR

This work introduces Visual-ARFT, a reward-driven reinforcement fine-tuning framework that equips large vision-language models with agentic, tool-using capabilities for multimodal reasoning. Built atop verifiable rewards and GRPO, it enables tasks like agentic web search and code-driven image manipulation, evaluated on the MAT benchmark (MAT-Search and MAT-Coding) and extended to existing multi-hop QA datasets. Key contributions include a modular reward design with format and accuracy signals, a dedicated multimodal benchmark suite, and strong empirical gains that, in some settings, surpass proprietary baselines such as GPT-4o, highlighting the potential of open-source, agentic multimodal systems. The results demonstrate improved data efficiency, generalization to out-of-domain tasks, and a promising direction for scalable, tool-enabled multimodal reasoning. This work thus provides a concrete, benchmarked path toward robust multimodal agents capable of planning, reasoning, and interacting with external tools in real time.

Abstract

A key trend in Large Reasoning Models (e.g., OpenAI's o3) is the native agentic ability to use external tools such as web browsers for searching and writing/executing code for image manipulation to think with images. In the open-source research community, while significant progress has been made in language-only agentic abilities such as function calling and tool integration, the development of multi-modal agentic capabilities that involve truly thinking with images, and their corresponding benchmarks, are still less explored. This work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large Vision-Language Models (LVLMs). With Visual-ARFT, open-source LVLMs gain the ability to browse websites for real-time information updates and write code to manipulate and analyze input images through cropping, rotation, and other image processing techniques. We also present a Multi-modal Agentic Tool Bench (MAT) with two settings (MAT-Search and MAT-Coding) designed to evaluate LVLMs' agentic search and coding abilities. Our experimental results demonstrate that Visual-ARFT outperforms its baseline by +18.6% F1 / +13.0% EM on MAT-Coding and +10.3% F1 / +8.7% EM on MAT-Search, ultimately surpassing GPT-4o. Visual-ARFT also achieves +29.3 F1% / +25.9% EM gains on existing multi-hop QA benchmarks such as 2Wiki and HotpotQA, demonstrating strong generalization capabilities. Our findings suggest that Visual-ARFT offers a promising path toward building robust and generalizable multimodal agents.

Visual Agentic Reinforcement Fine-Tuning

TL;DR

This work introduces Visual-ARFT, a reward-driven reinforcement fine-tuning framework that equips large vision-language models with agentic, tool-using capabilities for multimodal reasoning. Built atop verifiable rewards and GRPO, it enables tasks like agentic web search and code-driven image manipulation, evaluated on the MAT benchmark (MAT-Search and MAT-Coding) and extended to existing multi-hop QA datasets. Key contributions include a modular reward design with format and accuracy signals, a dedicated multimodal benchmark suite, and strong empirical gains that, in some settings, surpass proprietary baselines such as GPT-4o, highlighting the potential of open-source, agentic multimodal systems. The results demonstrate improved data efficiency, generalization to out-of-domain tasks, and a promising direction for scalable, tool-enabled multimodal reasoning. This work thus provides a concrete, benchmarked path toward robust multimodal agents capable of planning, reasoning, and interacting with external tools in real time.

Abstract

A key trend in Large Reasoning Models (e.g., OpenAI's o3) is the native agentic ability to use external tools such as web browsers for searching and writing/executing code for image manipulation to think with images. In the open-source research community, while significant progress has been made in language-only agentic abilities such as function calling and tool integration, the development of multi-modal agentic capabilities that involve truly thinking with images, and their corresponding benchmarks, are still less explored. This work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large Vision-Language Models (LVLMs). With Visual-ARFT, open-source LVLMs gain the ability to browse websites for real-time information updates and write code to manipulate and analyze input images through cropping, rotation, and other image processing techniques. We also present a Multi-modal Agentic Tool Bench (MAT) with two settings (MAT-Search and MAT-Coding) designed to evaluate LVLMs' agentic search and coding abilities. Our experimental results demonstrate that Visual-ARFT outperforms its baseline by +18.6% F1 / +13.0% EM on MAT-Coding and +10.3% F1 / +8.7% EM on MAT-Search, ultimately surpassing GPT-4o. Visual-ARFT also achieves +29.3 F1% / +25.9% EM gains on existing multi-hop QA benchmarks such as 2Wiki and HotpotQA, demonstrating strong generalization capabilities. Our findings suggest that Visual-ARFT offers a promising path toward building robust and generalizable multimodal agents.

Paper Structure

This paper contains 29 sections, 5 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: The benefits of our VisualAgentic Reinforcement Fine-Tuning (Visual-ARFT) to perform complex multi-modal reasoning tasks, such as (top) write and execute python code to accurately read text within a specified image region and (bottom) use internet search to answer a multi-hop question.
  • Figure 2: Overview of Visual-ARFT. We successfully empower LVLMs with multimodal agentic capabilities, including (a) agentic search and (b) agentic coding, enabling them to solve complex multimodal tasks through reasoning, decomposition, and tool interaction.
  • Figure 3: Data Annotation Pipeline of our proposed Multimodal Agentic Tool Bench (MAT): (a) MAT-Search, a manually annotated and verified dataset for agentic search, and (b) MAT-Coding, an automatically generated dataset for agentic coding with a structured pipeline.
  • Figure 4: Visualization Inference Cases of Visual-ARFT. Demonstrating Visual-ARFT's multi-modal agentic capabilities: processing an image and answering a question via code generation and execution (left), and solving multi-hop VQA through query decomposition and search tool invocation (right).
  • Figure 5: Prompt for Agentic Searching Tasks
  • ...and 4 more figures