Table of Contents
Fetching ...

Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning

Kuan-Yi Lee, Tsung-En Lin, Hung-Yi Lee

TL;DR

The paper addresses the limitation of end-to-end audio-language models in performing structured, low-level audio reasoning. It proposes Audio-Maestro, a two-phase, tool-augmented framework that lets a Large Audio-Language Model autonomously call external audio tools and integrate their timestamped outputs into its reasoning. Empirical results on the MMAU benchmark show consistent accuracy improvements across multiple base models, and the authors provide detailed analyses of audio versus tool contributions and error sources. The work advances interpretability and grounding in multi-domain audio reasoning and points to tool reliability and latency as important directions for future improvement.

Abstract

Recent advancements in large multimodal models (LMMs) have shown strong capabilities in audio understanding. However, most systems rely solely on end-to-end reasoning, limiting interpretability and accuracy for tasks that require structured knowledge or specialized signal analysis. In this work, we present Audio-Maestro -- a tool-augmented audio reasoning framework that enables audio-language models to autonomously call external tools and integrate their timestamped outputs into the reasoning process. This design allows the model to analyze, transform, and interpret audio signals through specialized tools rather than relying solely on end-to-end inference. Experiments show that Audio-Maestro consistently improves general audio reasoning performance: Gemini-2.5-flash's average accuracy on MMAU-Test rises from 67.4% to 72.1%, DeSTA-2.5 from 58.3% to 62.8%, and GPT-4o from 60.8% to 63.9%. To our knowledge, Audio-Maestro is the first framework to integrate structured tool output into the large audio language model reasoning process.

Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning

TL;DR

The paper addresses the limitation of end-to-end audio-language models in performing structured, low-level audio reasoning. It proposes Audio-Maestro, a two-phase, tool-augmented framework that lets a Large Audio-Language Model autonomously call external audio tools and integrate their timestamped outputs into its reasoning. Empirical results on the MMAU benchmark show consistent accuracy improvements across multiple base models, and the authors provide detailed analyses of audio versus tool contributions and error sources. The work advances interpretability and grounding in multi-domain audio reasoning and points to tool reliability and latency as important directions for future improvement.

Abstract

Recent advancements in large multimodal models (LMMs) have shown strong capabilities in audio understanding. However, most systems rely solely on end-to-end reasoning, limiting interpretability and accuracy for tasks that require structured knowledge or specialized signal analysis. In this work, we present Audio-Maestro -- a tool-augmented audio reasoning framework that enables audio-language models to autonomously call external tools and integrate their timestamped outputs into the reasoning process. This design allows the model to analyze, transform, and interpret audio signals through specialized tools rather than relying solely on end-to-end inference. Experiments show that Audio-Maestro consistently improves general audio reasoning performance: Gemini-2.5-flash's average accuracy on MMAU-Test rises from 67.4% to 72.1%, DeSTA-2.5 from 58.3% to 62.8%, and GPT-4o from 60.8% to 63.9%. To our knowledge, Audio-Maestro is the first framework to integrate structured tool output into the large audio language model reasoning process.

Paper Structure

This paper contains 25 sections, 4 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Overview of the Audio-Maestro framework. Given an audio input, query, and toolkit, the LALM first decides whether to answer directly or call tools in Phase 1. In Phase 2, selected tools are executed on the audio, producing structured, timestamped outputs that are integrated with the query and audio for final inference.
  • Figure 2: Performance of Gemini-2.5-flash, DeSTA-2.5, and GPT-4o on the MMAU Benchmark. The results are segmented into eight categories, and T denotes MMAU-Test and Tm denotes MMAU-Test-Mini.