Table of Contents
Fetching ...

VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection

Zeyi Huang, Yuyang Ji, Anirudh Sundara Rajan, Zefan Cai, Wen Xiao, Haohan Wang, Junjie Hu, Yong Jae Lee

TL;DR

VisTA introduces a reinforcement learning framework that trains an agent to autonomously select and combine external visual tools for multimodal reasoning while keeping the underlying reasoner frozen. The approach uses Group Relative Policy Optimization to learn query-specific tool sequences based on empirical performance, enabling exploration of diverse tool pathways without explicit reasoning supervision. Empirical results across ChartQA, Geometry3K, and related benchmarks show substantial gains over training-free baselines, with strong OoD generalization and transferability to stronger reasoning models. The work demonstrates that learned tool selection can significantly augment visual reasoning systems and lays a foundation for modular, experience-driven multimodal reasoning.

Abstract

We introduce VisTA, a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and combine tools from a diverse library based on empirical performance. Existing methods for tool-augmented reasoning either rely on training-free prompting or large-scale fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning methods additionally demand extensive human supervision. In contrast, VisTA leverages end-to-end reinforcement learning to iteratively refine sophisticated, query-specific tool selection strategies, using task outcomes as feedback signals. Through Group Relative Policy Optimization (GRPO), our framework enables an agent to autonomously discover effective tool-selection pathways without requiring explicit reasoning supervision. Experiments on the ChartQA, Geometry3K, and BlindTest benchmarks demonstrate that VisTA achieves substantial performance gains over training-free baselines, especially on out-of-distribution examples. These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.

VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection

TL;DR

VisTA introduces a reinforcement learning framework that trains an agent to autonomously select and combine external visual tools for multimodal reasoning while keeping the underlying reasoner frozen. The approach uses Group Relative Policy Optimization to learn query-specific tool sequences based on empirical performance, enabling exploration of diverse tool pathways without explicit reasoning supervision. Empirical results across ChartQA, Geometry3K, and related benchmarks show substantial gains over training-free baselines, with strong OoD generalization and transferability to stronger reasoning models. The work demonstrates that learned tool selection can significantly augment visual reasoning systems and lays a foundation for modular, experience-driven multimodal reasoning.

Abstract

We introduce VisTA, a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and combine tools from a diverse library based on empirical performance. Existing methods for tool-augmented reasoning either rely on training-free prompting or large-scale fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning methods additionally demand extensive human supervision. In contrast, VisTA leverages end-to-end reinforcement learning to iteratively refine sophisticated, query-specific tool selection strategies, using task outcomes as feedback signals. Through Group Relative Policy Optimization (GRPO), our framework enables an agent to autonomously discover effective tool-selection pathways without requiring explicit reasoning supervision. Experiments on the ChartQA, Geometry3K, and BlindTest benchmarks demonstrate that VisTA achieves substantial performance gains over training-free baselines, especially on out-of-distribution examples. These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.

Paper Structure

This paper contains 20 sections, 2 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Overview of VisTA. (Left) Our method trains an agent to autonomously discover effective combinations of visual tools without human supervision. (Right) By decoupling the agent from the reasoner, the learned policy can be seamlessly integrated with a wide range of reasoning models.
  • Figure 2: Policy Optimization. Given a user query, the agent selects tools from a pre-defined set of external tools. The tools are applied to the image, and their outputs and the query are fed to a frozen reasoner model. Both the Direct Path (query+image) and the Tool-Augmented Path (query+tools+image) are evaluated to compute a reward signal, which is used to update the agent's tool-selection policy.
  • Figure 3: Comparison of ChartQA accuracy across individual tools (T0–T8), the no-tool baseline (No), our RL-based selection policy (Ours), and a pseudo-upper bound (Upper).
  • Figure 4: Pearson correlation between tool usage frequency and individual tool performance.
  • Figure 5: Tool selection frequency across our RL-trained agent, QwenVL-7B, and GPT-4o. Our method strongly favors effective tools (Tools 1 and 2) and avoids less useful ones, while QwenVL-7B shows a uniform distribution and GPT-4o selects broadly without clear alignment to tool performance.
  • ...and 2 more figures