Table of Contents
Fetching ...

AudioGenie-Reasoner: A Training-Free Multi-Agent Framework for Coarse-to-Fine Audio Deep Reasoning

Yan Rong, Chenxing Li, Dong Yu, Li Liu

TL;DR

This work tackles the gap between audio perception and deep reasoning by introducing AudioGenie-Reasoner (AGR), a training-free multi-agent system that converts audio into a coarse textual document and iteratively refines an evolving textual evidence chain using a diagnose-plan-act loop. By decoupling perception (via ALLMs) from reasoning (via LLMs) and leveraging tool-augmented actions, AGR achieves state-of-the-art or competitive results on MMAU-mini and MMAR benchmarks without audio-specific reasoning training. Key contributions include a unified training-free MAS for audio deep reasoning, a coarse-to-fine text understanding framework, and demonstrated performance gains that validate the effectiveness of proactive iterative refinement. This approach reduces dependency on large, annotated audio reasoning datasets and highlights the practical potential of coupling perceptual strengths with advanced language-model reasoning for complex multi-step audio tasks.

Abstract

Audio deep reasoning is a challenging task that requires expert-level perception, multi-step logical inference, and the integration of contextual knowledge. However, existing models suffer from a gap between audio perception and reasoning abilities due to the lack of training data with explicit reasoning chains and the absence of mechanisms for active exploration and iterative refinement. To address these challenges, we propose AudioGenie-Reasoner (AGR), the first unified training-free multi-agent system that coordinates perception and reasoning over an evolving chain of textual evidence. Our key idea is a paradigm shift that transforms audio deep reasoning into complex text understanding task from a new perspective, thereby unlocking the full potential of large language models. Specifically, the design of AGR mimics the human coarse-to-fine cognitive process. It first transforms the input audio into a coarse text-based document. Then, we design a novel proactive iterative document refinement loop, featuring tool-augmented routes and specialized agents, to continuously search for missing information and augment the evidence chain in a coarse-to-fine manner until sufficient question-related information is gathered for making final predictions. Experimental results show that AGR achieves state-of-the-art (SOTA) performance over existing open-source audio deep reasoning models across various benchmarks. The code will be available at https://github.com/ryysayhi/AudioGenie-Reasoner.

AudioGenie-Reasoner: A Training-Free Multi-Agent Framework for Coarse-to-Fine Audio Deep Reasoning

TL;DR

This work tackles the gap between audio perception and deep reasoning by introducing AudioGenie-Reasoner (AGR), a training-free multi-agent system that converts audio into a coarse textual document and iteratively refines an evolving textual evidence chain using a diagnose-plan-act loop. By decoupling perception (via ALLMs) from reasoning (via LLMs) and leveraging tool-augmented actions, AGR achieves state-of-the-art or competitive results on MMAU-mini and MMAR benchmarks without audio-specific reasoning training. Key contributions include a unified training-free MAS for audio deep reasoning, a coarse-to-fine text understanding framework, and demonstrated performance gains that validate the effectiveness of proactive iterative refinement. This approach reduces dependency on large, annotated audio reasoning datasets and highlights the practical potential of coupling perceptual strengths with advanced language-model reasoning for complex multi-step audio tasks.

Abstract

Audio deep reasoning is a challenging task that requires expert-level perception, multi-step logical inference, and the integration of contextual knowledge. However, existing models suffer from a gap between audio perception and reasoning abilities due to the lack of training data with explicit reasoning chains and the absence of mechanisms for active exploration and iterative refinement. To address these challenges, we propose AudioGenie-Reasoner (AGR), the first unified training-free multi-agent system that coordinates perception and reasoning over an evolving chain of textual evidence. Our key idea is a paradigm shift that transforms audio deep reasoning into complex text understanding task from a new perspective, thereby unlocking the full potential of large language models. Specifically, the design of AGR mimics the human coarse-to-fine cognitive process. It first transforms the input audio into a coarse text-based document. Then, we design a novel proactive iterative document refinement loop, featuring tool-augmented routes and specialized agents, to continuously search for missing information and augment the evidence chain in a coarse-to-fine manner until sufficient question-related information is gathered for making final predictions. Experimental results show that AGR achieves state-of-the-art (SOTA) performance over existing open-source audio deep reasoning models across various benchmarks. The code will be available at https://github.com/ryysayhi/AudioGenie-Reasoner.

Paper Structure

This paper contains 9 sections, 5 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Performance comparison of AudioGenie-Reasoner with other audio reasoning models. Our framework excels in providing correct answers and valid reasoning.
  • Figure 2: The multi-agent architecture of AudioGenie-Reasoner. Specialized agents for planning, interaction, and augmentation collaborate within an iterative loop to refine a coarse initial caption into an evolving textual evidence chain.