Table of Contents
Fetching ...

Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage

Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaojian Ma, Tao Yuan, Yue Fan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li

TL;DR

The paper tackles the challenge of enabling robust tool usage in multi-modal tasks by training vision-language models (VLMs) as controllers through large-scale, automatically generated multi-modal trajectories (MM-Traj). It introduces a three-stage data synthesis pipeline (query, file, trajectory generation) with verifiers, yielding 20K high-quality data points. The T3-Agent, a VLM-driven agent within the ReAct framework, is trained via Trajectory Tuning on MM-Traj and demonstrates substantial performance gains on GTA and GAIA benchmarks compared to untrained VLMs and several baselines. The work highlights the value of data-centric tuning for multi-modal tool reasoning and presents MM-Traj as a valuable resource for advancing VLM-based tool usage. Limitations include reliance on query-centric multi-modal data and the authors' plan to incorporate trajectory-level multi-modal information in future work.

Abstract

The advancement of large language models (LLMs) prompts the development of multi-modal agents, which are used as a controller to call external tools, providing a feasible way to solve practical tasks. In this paper, we propose a multi-modal agent tuning method that automatically generates multi-modal tool-usage data and tunes a vision-language model (VLM) as the controller for powerful tool-usage reasoning. To preserve the data quality, we prompt the GPT-4o mini model to generate queries, files, and trajectories, followed by query-file and trajectory verifiers. Based on the data synthesis pipeline, we collect the MM-Traj dataset that contains 20K tasks with trajectories of tool usage. Then, we develop the T3-Agent via \underline{T}rajectory \underline{T}uning on VLMs for \underline{T}ool usage using MM-Traj. Evaluations on the GTA and GAIA benchmarks show that the T3-Agent consistently achieves improvements on two popular VLMs: MiniCPM-V-8.5B and {Qwen2-VL-7B}, which outperforms untrained VLMs by $20\%$, showing the effectiveness of the proposed data synthesis pipeline, leading to high-quality data for tool-usage capabilities.

Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage

TL;DR

The paper tackles the challenge of enabling robust tool usage in multi-modal tasks by training vision-language models (VLMs) as controllers through large-scale, automatically generated multi-modal trajectories (MM-Traj). It introduces a three-stage data synthesis pipeline (query, file, trajectory generation) with verifiers, yielding 20K high-quality data points. The T3-Agent, a VLM-driven agent within the ReAct framework, is trained via Trajectory Tuning on MM-Traj and demonstrates substantial performance gains on GTA and GAIA benchmarks compared to untrained VLMs and several baselines. The work highlights the value of data-centric tuning for multi-modal tool reasoning and presents MM-Traj as a valuable resource for advancing VLM-based tool usage. Limitations include reliance on query-centric multi-modal data and the authors' plan to incorporate trajectory-level multi-modal information in future work.

Abstract

The advancement of large language models (LLMs) prompts the development of multi-modal agents, which are used as a controller to call external tools, providing a feasible way to solve practical tasks. In this paper, we propose a multi-modal agent tuning method that automatically generates multi-modal tool-usage data and tunes a vision-language model (VLM) as the controller for powerful tool-usage reasoning. To preserve the data quality, we prompt the GPT-4o mini model to generate queries, files, and trajectories, followed by query-file and trajectory verifiers. Based on the data synthesis pipeline, we collect the MM-Traj dataset that contains 20K tasks with trajectories of tool usage. Then, we develop the T3-Agent via \underline{T}rajectory \underline{T}uning on VLMs for \underline{T}ool usage using MM-Traj. Evaluations on the GTA and GAIA benchmarks show that the T3-Agent consistently achieves improvements on two popular VLMs: MiniCPM-V-8.5B and {Qwen2-VL-7B}, which outperforms untrained VLMs by , showing the effectiveness of the proposed data synthesis pipeline, leading to high-quality data for tool-usage capabilities.

Paper Structure

This paper contains 46 sections, 2 equations, 24 figures, 9 tables.

Figures (24)

  • Figure 1: The comparison of the LLM (GPT-4)-driven agent and our T3-Agent. Our agent chooses more precise tools based on the given files and intermediate observations.
  • Figure 2: The pipeline for data generation.
  • Figure 3: Data statistics on the MM-Traj dataset.
  • Figure 4: Case study of the T3-Agent in the GTA benchmark.
  • Figure 5: Case study of the T3-Agent in the GAIA benchmark.
  • ...and 19 more figures