Table of Contents
Fetching ...

AudioRouter: Data Efficient Audio Understanding via RL based Dual Reasoning

Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, Yiwei Wang

TL;DR

AudioRouter introduces a data-efficient reinforcement learning framework that decouples tool-use decisions from core audio reasoning by employing a lightweight Router and a frozen Reasoner. The Router learns a relative outcome reward to decide when invoking external audio tools improves final predictions, achieving up to $25$–$647\times$ data savings while delivering state-of-the-art results on MMAU-mini and MMAR. This approach demonstrates that learning how to use external perceptual tools can be more scalable and practical than end-to-end perceptual internalization for LALMs. The framework offers a principled pathway to robust, tool-augmented audio understanding with broad potential across modalities.

Abstract

Large Audio Language Models (LALMs) have demonstrated strong capabilities in audio understanding and reasoning. However, their performance on fine grained auditory perception remains unreliable, and existing approaches largely rely on data intensive training to internalize perceptual abilities. We propose AudioRouter, a reinforcement learning framework that enables LALMs to improve audio understanding by learning when and how to use external audio tools. Rather than tightly coupling tool usage with audio reasoning, AudioRouter formulates tool use as an explicit decision making problem and optimizes a lightweight routing policy while keeping the underlying reasoning model frozen. Experimental results show that AudioRouter achieves substantial improvements on standard audio understanding benchmarks while requiring up to 600x less training data to learn tool usage compared with conventional training paradigms. These findings suggest that learning effective tool usage offers a data efficient and scalable alternative to internalizing perceptual abilities in LALMs.

AudioRouter: Data Efficient Audio Understanding via RL based Dual Reasoning

TL;DR

AudioRouter introduces a data-efficient reinforcement learning framework that decouples tool-use decisions from core audio reasoning by employing a lightweight Router and a frozen Reasoner. The Router learns a relative outcome reward to decide when invoking external audio tools improves final predictions, achieving up to data savings while delivering state-of-the-art results on MMAU-mini and MMAR. This approach demonstrates that learning how to use external perceptual tools can be more scalable and practical than end-to-end perceptual internalization for LALMs. The framework offers a principled pathway to robust, tool-augmented audio understanding with broad potential across modalities.

Abstract

Large Audio Language Models (LALMs) have demonstrated strong capabilities in audio understanding and reasoning. However, their performance on fine grained auditory perception remains unreliable, and existing approaches largely rely on data intensive training to internalize perceptual abilities. We propose AudioRouter, a reinforcement learning framework that enables LALMs to improve audio understanding by learning when and how to use external audio tools. Rather than tightly coupling tool usage with audio reasoning, AudioRouter formulates tool use as an explicit decision making problem and optimizes a lightweight routing policy while keeping the underlying reasoning model frozen. Experimental results show that AudioRouter achieves substantial improvements on standard audio understanding benchmarks while requiring up to 600x less training data to learn tool usage compared with conventional training paradigms. These findings suggest that learning effective tool usage offers a data efficient and scalable alternative to internalizing perceptual abilities in LALMs.
Paper Structure (34 sections, 7 equations, 4 figures, 5 tables)

This paper contains 34 sections, 7 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Two pervasive failure modes in tool-use systems and our capability-aware routing framework. Top:Surface-Level Keyword Bias selects tools based on semantic keyword overlap despite functional mismatch, producing valid but irrelevant outputs (left). Hallucination of Tool Capability Boundaries occurs when models extrapolate tool outputs beyond their explicit capability scope (right). Bottom: Our AudioRouter models tool use as a capability-aware routing problem, prioritizing functional validity over surface cues and restricting inference within tool capability boundaries via relative reward.
  • Figure 2: Overview of the proposed AudioRouter framework. A Router first decides whether to invoke a tool based on capability aware routing, followed by a fixed Reasoner for audio reasoning. The Router is optimized via a relative outcome reward by comparing tool-augmented and direct reasoning results, encouraging beneficial tool usage while suppressing redundant or harmful tool calls.
  • Figure 3: Relative outcome reward for training the Router policy. Tool-augmented decisions are rewarded or penalized based on their correctness relative to direct reasoning, encouraging boundary breaking tool usage while suppressing redundant or harmful tool calls.
  • Figure 4: Main results on MMAU-mini and MMAR. Left: Post training data scale comparison. AudioRouter achieves $25\times$–$647\times$ data savings. Right: Performance comparison under Qwen2.5-Omni and Qwen2-Audio backbones, where AudioRouter consistently improves accuracy over end-to-end baselines and remains competitive with closed source models.