Table of Contents
Fetching ...

AudioToolAgent: An Agentic Framework for Audio-Language Models

Gijs Wijngaard, Elia Formisano, Michel Dumontier

TL;DR

AudioToolAgent addresses the gap between reasoning-enabled LLMs and specialized audio tools by introducing a modular framework where a text-only central agent coordinates diverse audio-language models as tools. It leverages tool adapters and structured <tool_call> interactions to perform iterative questioning, tool usage, and cross-verification, all without additional training data. Empirical results on MMAU, MMAR, and MMAU-Pro show state-of-the-art performance, with Shapley-value analyses highlighting the most impactful tools and agents. The approach offers a cost-efficient, extensible path for robust multimodal audio understanding and reasoning, suitable for future expansion to broader tasks and tool ecosystems.

Abstract

Large Audio-Language Models (LALMs) perform well on audio understanding tasks but lack multi-step reasoning and tool-calling found in recent Large Language Models (LLMs). This paper presents AudioToolAgent, a framework that coordinates audio-language models as tools via a central LLM agent that accesses tool adapters for audio question answering and speech-to-text. The agent selects tools, asks follow-up questions, and compares outputs for verification. Experiments with MMAU, MMAR, and MMAU-Pro show state-of-the-art accuracy: up to 74.10% on MMAU, 68.80% on MMAR, and 57.96% on MMAU-Pro. Monte Carlo sampling for shapley values across 374 configurations identifies effective agent-tool combinations. The modular design allows integration of new tools and eliminates the use of data and training costs. Code and reproduction materials are available at: github.com/GLJS/AudioToolAgent

AudioToolAgent: An Agentic Framework for Audio-Language Models

TL;DR

AudioToolAgent addresses the gap between reasoning-enabled LLMs and specialized audio tools by introducing a modular framework where a text-only central agent coordinates diverse audio-language models as tools. It leverages tool adapters and structured <tool_call> interactions to perform iterative questioning, tool usage, and cross-verification, all without additional training data. Empirical results on MMAU, MMAR, and MMAU-Pro show state-of-the-art performance, with Shapley-value analyses highlighting the most impactful tools and agents. The approach offers a cost-efficient, extensible path for robust multimodal audio understanding and reasoning, suitable for future expansion to broader tasks and tool ecosystems.

Abstract

Large Audio-Language Models (LALMs) perform well on audio understanding tasks but lack multi-step reasoning and tool-calling found in recent Large Language Models (LLMs). This paper presents AudioToolAgent, a framework that coordinates audio-language models as tools via a central LLM agent that accesses tool adapters for audio question answering and speech-to-text. The agent selects tools, asks follow-up questions, and compares outputs for verification. Experiments with MMAU, MMAR, and MMAU-Pro show state-of-the-art accuracy: up to 74.10% on MMAU, 68.80% on MMAR, and 57.96% on MMAU-Pro. Monte Carlo sampling for shapley values across 374 configurations identifies effective agent-tool combinations. The modular design allows integration of new tools and eliminates the use of data and training costs. Code and reproduction materials are available at: github.com/GLJS/AudioToolAgent

Paper Structure

This paper contains 12 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Example of a chatbot using the AudioToolAgent framework. The agent is a large language model that coordinates the tools; the tools are audio-language models.
  • Figure 2: Schematic overview of the AudioToolAgent framework. The framework has two components: a central agent and a set of tool models. The agent is a large language model that coordinates the tools; the tools are audio-language models.
  • Figure 3: Per-model accuracies (dots; 5 runs) and mean accuracy (horizontal bar) on a subset of the MMAU test-mini split. Vertical black ticks denote the corresponding ALM without any tools.
  • Figure 4: Shapley values per feature with standard error bars computed over 374 runs. Bars indicate the contribution of the audio-language model or API as tool to the overall system performance. Error bars are black.