Table of Contents
Fetching ...

Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning

Daiqing Wu, Xuan Zhang, Dongbao Yang, Jiashu Yao, Longfei Chen, Qingsong Liu, Sicheng Zhao, Can Ma, Yangyang Kang, Yu Zhou

TL;DR

This work addresses the bottleneck of standard audio reasoning in large audio language models by introducing audio-interleaved reasoning, where audio content is actively consulted during reasoning rather than merely encoded as context. Echo, a 7B LALM, is developed through a two-stage training regime (SFT for audio-grounded CoTs and RL for adaptive re-listening) paired with a structured data-generation pipeline that yields EAQA-SFT/EAQA-RL datasets. Empirical results on MMAR, MMAU-mini, and MMAU benchmarks show Echo achieving superior or highly competitive performance, highlighting the value of sustained audio engagement and precise segment localization for expert-level audio comprehension. The paper analyzes reward design, data quality, and training dynamics, and discusses future refinements and ethical considerations for scalable, responsible audio reasoning research.

Abstract

The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a critical information bottleneck. Drawing inspiration from human cognition, we propose audio-interleaved reasoning to break through this bottleneck. It treats audio as an active reasoning component, enabling sustained audio engagement and perception-grounded analysis. To instantiate it, we introduce a two-stage training framework, first teaching LALMs to localize salient audio segments through supervised fine-tuning, and then incentivizing proficient re-listening via reinforcement learning. In parallel, a structured data generation pipeline is developed to produce high-quality training data. Consequently, we present Echo, a LALM capable of dynamically re-listening to audio in demand during reasoning. On audio comprehension benchmarks, Echo achieves overall superiority in both challenging expert-level and general-purpose tasks. Comprehensive analysis further confirms the efficiency and generalizability of audio-interleaved reasoning, establishing it as a promising direction for advancing audio comprehension. Project page: https://github.com/wdqqdw/Echo.

Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning

TL;DR

This work addresses the bottleneck of standard audio reasoning in large audio language models by introducing audio-interleaved reasoning, where audio content is actively consulted during reasoning rather than merely encoded as context. Echo, a 7B LALM, is developed through a two-stage training regime (SFT for audio-grounded CoTs and RL for adaptive re-listening) paired with a structured data-generation pipeline that yields EAQA-SFT/EAQA-RL datasets. Empirical results on MMAR, MMAU-mini, and MMAU benchmarks show Echo achieving superior or highly competitive performance, highlighting the value of sustained audio engagement and precise segment localization for expert-level audio comprehension. The paper analyzes reward design, data quality, and training dynamics, and discusses future refinements and ethical considerations for scalable, responsible audio reasoning research.

Abstract

The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a critical information bottleneck. Drawing inspiration from human cognition, we propose audio-interleaved reasoning to break through this bottleneck. It treats audio as an active reasoning component, enabling sustained audio engagement and perception-grounded analysis. To instantiate it, we introduce a two-stage training framework, first teaching LALMs to localize salient audio segments through supervised fine-tuning, and then incentivizing proficient re-listening via reinforcement learning. In parallel, a structured data generation pipeline is developed to produce high-quality training data. Consequently, we present Echo, a LALM capable of dynamically re-listening to audio in demand during reasoning. On audio comprehension benchmarks, Echo achieves overall superiority in both challenging expert-level and general-purpose tasks. Comprehensive analysis further confirms the efficiency and generalizability of audio-interleaved reasoning, establishing it as a promising direction for advancing audio comprehension. Project page: https://github.com/wdqqdw/Echo.
Paper Structure (42 sections, 6 equations, 18 figures, 15 tables, 1 algorithm)

This paper contains 42 sections, 6 equations, 18 figures, 15 tables, 1 algorithm.

Figures (18)

  • Figure 1: Comparison between audio-conditioned text reasoning and audio-interleaved reasoning. (a) and (b) compares attention allocated by the LALM to prompt, response, and audio tokens during reasoning, averaged over 100 samples. By switching from audio-conditioned text reasoning to audio-interleaved reasoning, LALM places significantly higher attention focus on the audio tokens ($\Delta$+140%), thereby leading to more meaningful and traceable audio analysis.
  • Figure 2: Summarized illustration of the training framework. The base model (a) is first enabled to localize and reference audio segments via SFT (b). The obtained cold-start model (c) is then equipped with audio-interleaved reasoning via inference adaptation (d): the inference process is paused whenever segment tags are encountered, and the corresponding raw audio segments are inserted afterwards before resuming. Subsequently, RL (e) is applied to further endow the model with competence in flexible audio invocation and accurate responding.
  • Figure 3: Overview of the data generation pipeline. It begins with an audio dataset containing fine-grained temporal metadata. For each audio data, Qwen2.5-Omni is employed to derive a structured caption encompassing comprehensive descriptions, speech content, and musical elements. This information, along with the temporal metadata, is then fed into DeepSeek-R1 deepseek-r1 for synthesizing QA-CoT triplets. The synthesized triplets undergo further filtering based on the quality of Audio-QA and CoT, and are subsequently appended into two separate datasets for SFT and RL.
  • Figure 4: Evolvement of (a,b): reward components, (c,d,e): segment-associated statistics, and (f): model divergence during the RL process. (e) evaluates the temporal overlap among segments within each response. (f) measures the distribution discrepancy between the policy and the reference model.
  • Figure 5: Progression of fine-grained cognitive abilities from the base model, to the cold-start model, and finally to Echo. The evaluation selects tasks from 10 representative skills, annotated in MMAU-mini as core requirements for LALMs, encompassing reasoning over speech, music, and sound.
  • ...and 13 more figures