Table of Contents
Fetching ...

AdaTooler-V: Adaptive Tool-Use for Images and Videos

Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li, Sicheng Gao, Meng Meng, Xu Zhou, Manyuan Zhang, Yuzhang Shang, Xiangyu Yue

TL;DR

AdaTooler-V tackles inefficient tool-use in multimodal reasoning by introducing AT-GRPO, a reinforcement learning approach that adapts tool invocation based on a per-sample Tool Benefit Score. The method combines a cold-start SFT phase with a subsequent RL phase, supported by two large datasets (AdaTooler-V-CoT-100k and AdaTooler-V-300k) covering images and videos. Empirical results across 12 benchmarks show state-of-the-art performance, with AdaTooler-V-7B achieving 89.8% on the high-resolution V* benchmark and surpassing GPT-4o, while reducing unnecessary tool-use and inference cost. The work provides a practical framework for efficient, adaptive tool-augmented multimodal reasoning and releases code, models, and data for reproducibility.

Abstract

Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8\% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.

AdaTooler-V: Adaptive Tool-Use for Images and Videos

TL;DR

AdaTooler-V tackles inefficient tool-use in multimodal reasoning by introducing AT-GRPO, a reinforcement learning approach that adapts tool invocation based on a per-sample Tool Benefit Score. The method combines a cold-start SFT phase with a subsequent RL phase, supported by two large datasets (AdaTooler-V-CoT-100k and AdaTooler-V-300k) covering images and videos. Empirical results across 12 benchmarks show state-of-the-art performance, with AdaTooler-V-7B achieving 89.8% on the high-resolution V* benchmark and surpassing GPT-4o, while reducing unnecessary tool-use and inference cost. The work provides a practical framework for efficient, adaptive tool-augmented multimodal reasoning and releases code, models, and data for reproducibility.

Abstract

Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8\% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.

Paper Structure

This paper contains 33 sections, 5 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: (a) Compared with existing models that blindly invoke vision tools, AdaTooler-V adaptively invokes tools by determining whether the problem truly requires tools. (b) Distribution of $\Delta S$ values in the AdaTooler-V-300k dataset, where positive and negative values correspond to tool-helpful and tool-unhelpful samples. Here, $\Delta S$ is computed as the difference in average accuracy when Qwen2.5-VL-72B-Instruct bai2025qwen2 solves the same sample with and without tool-use.
  • Figure 2: Case reasoning trajectory of AdaTooler-V. For single-image and video questions, the model alternates between internal reasoning, vision tool invocations and final answers, enabling zoom-in on fine-grained regions and inspection of informative clips. In contrast, for the multi-image clock example, AdaTooler-V solves the problem purely via text-based CoT, illustrating its ability to adaptively decide when vision tools are truly necessary.
  • Figure 3: The data distribution of our AdaTooler-V-300k dataset.
  • Figure 4: An illustration of our proposed AT-GRPO.
  • Figure 5: RL training curves.
  • ...and 3 more figures