Table of Contents
Fetching ...

When Scaling Fails: Mitigating Audio Perception Decay of LALMs via Multi-Step Perception-Aware Reasoning

Ruixiang Mao, Xiangnan Ma, Dan Chen, Ziming Zhu, Yuan Ge, Aokai Hao, Haishu Zhao, Yifu Huo, Qing Yang, Kaiyan Chang, Xiaoqian Liu, Chenglong Wang, Qiaozhi He, Tong Xiao, Jingbo Zhu

TL;DR

Leveraging reinforcement learning, MPAR$^2$ improves perception performance on CAFE from 31.74% to 63.51% and effectively mitigates perception decay, concurrently enhancing reasoning capabilities to achieve a significant 74.59% accuracy on the MMAU benchmark.

Abstract

Test-Time Scaling has shown notable efficacy in addressing complex problems through scaling inference compute. However, within Large Audio-Language Models (LALMs), an unintuitive phenomenon exists: post-training models for structured reasoning trajectories results in marginal or even negative gains compared to post-training for direct answering. To investigate it, we introduce CAFE, an evaluation framework designed to precisely quantify audio reasoning errors. Evaluation results reveal LALMs struggle with perception during reasoning and encounter a critical bottleneck: reasoning performance suffers from audio perception decay as reasoning length extends. To address it, we propose MPAR$^2$, a paradigm that encourages dynamic perceptual reasoning and decomposes complex questions into perception-rich sub-problems. Leveraging reinforcement learning, MPAR$^2$ improves perception performance on CAFE from 31.74% to 63.51% and effectively mitigates perception decay, concurrently enhancing reasoning capabilities to achieve a significant 74.59% accuracy on the MMAU benchmark. Further analysis demonstrates that MPAR$^2$ reinforces LALMs to attend to audio input and dynamically adapts reasoning budget to match task complexity.

When Scaling Fails: Mitigating Audio Perception Decay of LALMs via Multi-Step Perception-Aware Reasoning

TL;DR

Leveraging reinforcement learning, MPAR improves perception performance on CAFE from 31.74% to 63.51% and effectively mitigates perception decay, concurrently enhancing reasoning capabilities to achieve a significant 74.59% accuracy on the MMAU benchmark.

Abstract

Test-Time Scaling has shown notable efficacy in addressing complex problems through scaling inference compute. However, within Large Audio-Language Models (LALMs), an unintuitive phenomenon exists: post-training models for structured reasoning trajectories results in marginal or even negative gains compared to post-training for direct answering. To investigate it, we introduce CAFE, an evaluation framework designed to precisely quantify audio reasoning errors. Evaluation results reveal LALMs struggle with perception during reasoning and encounter a critical bottleneck: reasoning performance suffers from audio perception decay as reasoning length extends. To address it, we propose MPAR, a paradigm that encourages dynamic perceptual reasoning and decomposes complex questions into perception-rich sub-problems. Leveraging reinforcement learning, MPAR improves perception performance on CAFE from 31.74% to 63.51% and effectively mitigates perception decay, concurrently enhancing reasoning capabilities to achieve a significant 74.59% accuracy on the MMAU benchmark. Further analysis demonstrates that MPAR reinforces LALMs to attend to audio input and dynamically adapts reasoning budget to match task complexity.
Paper Structure (37 sections, 8 equations, 7 figures, 2 tables)

This paper contains 37 sections, 8 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Comparison of direct-answer RL baselines (denoted by RL) and RL with explicit reasoning prompts (denoted by RL$^\dagger$). Three RL variants (R1-AQA, Audio-Thinker, and SARI) are tested on the MMAU benchmark. Here, the suffix (A) and (O) indicate Qwen2-Audio and Qwen2.5-Omni as base model. More details of the prompts used here are presented in Appendix \ref{['sec:prompts_rl']}.
  • Figure 2: Overview of the CAFE framework evaluation pipeline. First, audio model generates a reasoning process based on the audio-QA input. Meanwhile, the audio is captioned by Gemini-3-Pro. Finally, Gemini-3-Pro utilizes the inputs, the reasoning process, and the captions to extract different audio events.
  • Figure 3: Reasoning Accuracy (accuracy on the evaluated benchmarks) and CAFE's perception accuracy across reasoning token lengths. Reasoning accuracy aligns closely with perception accuracy. Specifically in the (a)-(d), both metrics decline as reasoning tokens increase, which we term Audio Perception Decay. The x-axis represents the mean of token length intervals, and the y-axis indicates the accuracy.
  • Figure 4: An illustration of MPAR$^2$ training pipeline. The top of the figure outlines the overall GRPO training setup, whereas the lower section provides a detailed view of the reward design. The cold start policy model is derived from the Stage 1 SFT-trained model.
  • Figure 5: Accuracy of MPAR$^2$-7B vs. Base across token lengths.
  • ...and 2 more figures