Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning
Daiqing Wu, Xuan Zhang, Dongbao Yang, Jiashu Yao, Longfei Chen, Qingsong Liu, Sicheng Zhao, Can Ma, Yangyang Kang, Yu Zhou
TL;DR
This work addresses the bottleneck of standard audio reasoning in large audio language models by introducing audio-interleaved reasoning, where audio content is actively consulted during reasoning rather than merely encoded as context. Echo, a 7B LALM, is developed through a two-stage training regime (SFT for audio-grounded CoTs and RL for adaptive re-listening) paired with a structured data-generation pipeline that yields EAQA-SFT/EAQA-RL datasets. Empirical results on MMAR, MMAU-mini, and MMAU benchmarks show Echo achieving superior or highly competitive performance, highlighting the value of sustained audio engagement and precise segment localization for expert-level audio comprehension. The paper analyzes reward design, data quality, and training dynamics, and discusses future refinements and ethical considerations for scalable, responsible audio reasoning research.
Abstract
The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a critical information bottleneck. Drawing inspiration from human cognition, we propose audio-interleaved reasoning to break through this bottleneck. It treats audio as an active reasoning component, enabling sustained audio engagement and perception-grounded analysis. To instantiate it, we introduce a two-stage training framework, first teaching LALMs to localize salient audio segments through supervised fine-tuning, and then incentivizing proficient re-listening via reinforcement learning. In parallel, a structured data generation pipeline is developed to produce high-quality training data. Consequently, we present Echo, a LALM capable of dynamically re-listening to audio in demand during reasoning. On audio comprehension benchmarks, Echo achieves overall superiority in both challenging expert-level and general-purpose tasks. Comprehensive analysis further confirms the efficiency and generalizability of audio-interleaved reasoning, establishing it as a promising direction for advancing audio comprehension. Project page: https://github.com/wdqqdw/Echo.
