Table of Contents
Fetching ...

ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use

Mengjie Deng, Guanting Dong, Zhicheng Dou

TL;DR

ToolScope presents a training-free, modular framework that unifies global task planning with local multimodal perception to tackle long-horizon VQA. By integrating a Global Navigator, an Agentic Executor with Perceive, Search, and Code tools, and a Response Synthesizer, it preserves visual context and enables structured tool use, achieving consistent gains across four benchmarks with up to +6.69% average improvement. The approach demonstrates strong generalization across backbones and scales with model size, while ablations and scaling analyses illuminate the contributions of each component and retrieval settings. Overall, ToolScope offers a practical blueprint for robust, tool-augmented reasoning in multimodal agents and highlights the importance of global-local coordination and perception memory in complex reasoning tasks.

Abstract

Recently, large language models (LLMs) have demonstrated remarkable problem-solving capabilities by autonomously integrating with external tools for collaborative reasoning. However, due to the inherently complex and diverse nature of multimodal information, enabling multimodal large language models (MLLMs) to flexibly and efficiently utilize external tools during reasoning remains an underexplored challenge. In this work, we introduce ToolScope, an agentic framework designed to unify global planning with local multimodal perception, adopting a specialized Perceive tool to mitigates visual context degradation in long-horizon VQA task. ToolScope comprises three primary components: the Global Navigator, the Agentic Executor, and the Response Synthesizer. The Global Navigator functions as a "telescope", offering high-level strategic guidance. The Agentic Executor operates iteratively to augment MLLM with local perception through the integration of external tools-Search, Code, and Perceive. Finally, the Response Synthesizer consolidates and organizes the reasoning process into a coherent, user-friendly output. We evaluate ToolScope on four VQA benchmarks across diverse domains, including VQA 2.0, ScienceQA, MAT-Search and MathVista. It demonstrates strong generalization capabilities, achieving an average performance improvement of up to +6.69% across all datasets.

ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use

TL;DR

ToolScope presents a training-free, modular framework that unifies global task planning with local multimodal perception to tackle long-horizon VQA. By integrating a Global Navigator, an Agentic Executor with Perceive, Search, and Code tools, and a Response Synthesizer, it preserves visual context and enables structured tool use, achieving consistent gains across four benchmarks with up to +6.69% average improvement. The approach demonstrates strong generalization across backbones and scales with model size, while ablations and scaling analyses illuminate the contributions of each component and retrieval settings. Overall, ToolScope offers a practical blueprint for robust, tool-augmented reasoning in multimodal agents and highlights the importance of global-local coordination and perception memory in complex reasoning tasks.

Abstract

Recently, large language models (LLMs) have demonstrated remarkable problem-solving capabilities by autonomously integrating with external tools for collaborative reasoning. However, due to the inherently complex and diverse nature of multimodal information, enabling multimodal large language models (MLLMs) to flexibly and efficiently utilize external tools during reasoning remains an underexplored challenge. In this work, we introduce ToolScope, an agentic framework designed to unify global planning with local multimodal perception, adopting a specialized Perceive tool to mitigates visual context degradation in long-horizon VQA task. ToolScope comprises three primary components: the Global Navigator, the Agentic Executor, and the Response Synthesizer. The Global Navigator functions as a "telescope", offering high-level strategic guidance. The Agentic Executor operates iteratively to augment MLLM with local perception through the integration of external tools-Search, Code, and Perceive. Finally, the Response Synthesizer consolidates and organizes the reasoning process into a coherent, user-friendly output. We evaluate ToolScope on four VQA benchmarks across diverse domains, including VQA 2.0, ScienceQA, MAT-Search and MathVista. It demonstrates strong generalization capabilities, achieving an average performance improvement of up to +6.69% across all datasets.

Paper Structure

This paper contains 23 sections, 4 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The Illustration of the benefits of our ToolScope to perform agentic tool-augmented reasoning in multimodal tasks. ToolScope enable MLLMs to zoom in on the image in detail, retrieve external knowledge to augment reasoning.
  • Figure 2: Overview of ToolScope. It consists of three components: (a) Global Navigator selects a subset toolkit from the tool pool, and generates global guidance. (b) Agentic Executor works iteratively to think, execute tool invocation, and continue reasoning based on tools. (c) Response Synthesizer consolidates the reasoning trajectory into a user-friendly response.
  • Figure 3: Performance scaling with respect to the number of retrieved documents (top-k) using Qwen2.5-VL-7B. Results are shown on MAT-Search and ScienceQA. Increasing k allows access to more context but may introduce noise.
  • Figure 4: Scaling analysis of the maximum number of reasoning turns (max turns) using Qwen2.5-VL-7B on MathVista and VQA 2.0.
  • Figure 5: Scaling analysis of the size of backbone models using InternVL3 series on ScienceQA.
  • ...and 1 more figures