When Scaling Fails: Mitigating Audio Perception Decay of LALMs via Multi-Step Perception-Aware Reasoning

Ruixiang Mao; Xiangnan Ma; Dan Chen; Ziming Zhu; Yuan Ge; Aokai Hao; Haishu Zhao; Yifu Huo; Qing Yang; Kaiyan Chang; Xiaoqian Liu; Chenglong Wang; Qiaozhi He; Tong Xiao; Jingbo Zhu

When Scaling Fails: Mitigating Audio Perception Decay of LALMs via Multi-Step Perception-Aware Reasoning

Ruixiang Mao, Xiangnan Ma, Dan Chen, Ziming Zhu, Yuan Ge, Aokai Hao, Haishu Zhao, Yifu Huo, Qing Yang, Kaiyan Chang, Xiaoqian Liu, Chenglong Wang, Qiaozhi He, Tong Xiao, Jingbo Zhu

TL;DR

Leveraging reinforcement learning, MPAR$^2$ improves perception performance on CAFE from 31.74% to 63.51% and effectively mitigates perception decay, concurrently enhancing reasoning capabilities to achieve a significant 74.59% accuracy on the MMAU benchmark.

Abstract

Test-Time Scaling has shown notable efficacy in addressing complex problems through scaling inference compute. However, within Large Audio-Language Models (LALMs), an unintuitive phenomenon exists: post-training models for structured reasoning trajectories results in marginal or even negative gains compared to post-training for direct answering. To investigate it, we introduce CAFE, an evaluation framework designed to precisely quantify audio reasoning errors. Evaluation results reveal LALMs struggle with perception during reasoning and encounter a critical bottleneck: reasoning performance suffers from audio perception decay as reasoning length extends. To address it, we propose MPAR$^2$, a paradigm that encourages dynamic perceptual reasoning and decomposes complex questions into perception-rich sub-problems. Leveraging reinforcement learning, MPAR$^2$ improves perception performance on CAFE from 31.74% to 63.51% and effectively mitigates perception decay, concurrently enhancing reasoning capabilities to achieve a significant 74.59% accuracy on the MMAU benchmark. Further analysis demonstrates that MPAR$^2$ reinforces LALMs to attend to audio input and dynamically adapts reasoning budget to match task complexity.

When Scaling Fails: Mitigating Audio Perception Decay of LALMs via Multi-Step Perception-Aware Reasoning

TL;DR

Leveraging reinforcement learning, MPAR

improves perception performance on CAFE from 31.74% to 63.51% and effectively mitigates perception decay, concurrently enhancing reasoning capabilities to achieve a significant 74.59% accuracy on the MMAU benchmark.

Abstract

, a paradigm that encourages dynamic perceptual reasoning and decomposes complex questions into perception-rich sub-problems. Leveraging reinforcement learning, MPAR

reinforces LALMs to attend to audio input and dynamically adapts reasoning budget to match task complexity.

Paper Structure (37 sections, 8 equations, 7 figures, 2 tables)

This paper contains 37 sections, 8 equations, 7 figures, 2 tables.

Introduction
Related Work
Large Audio Language Models
Large Audio Reasoning Models
Adaptive Reasoning and Difficulty Awareness
Reasoning-Time Audio Event Perception and Utilization Probing
Task Formulation
Overview
Metric Formulation
Framework details
Evaluation
Experimental Setup
Evaluation Result
Weak Perception during Reasoning
Audio Perception Decay Across Extended Reasoning
...and 22 more sections

Figures (7)

Figure 1: Comparison of direct-answer RL baselines (denoted by RL) and RL with explicit reasoning prompts (denoted by RL$^\dagger$). Three RL variants (R1-AQA, Audio-Thinker, and SARI) are tested on the MMAU benchmark. Here, the suffix (A) and (O) indicate Qwen2-Audio and Qwen2.5-Omni as base model. More details of the prompts used here are presented in Appendix \ref{['sec:prompts_rl']}.
Figure 2: Overview of the CAFE framework evaluation pipeline. First, audio model generates a reasoning process based on the audio-QA input. Meanwhile, the audio is captioned by Gemini-3-Pro. Finally, Gemini-3-Pro utilizes the inputs, the reasoning process, and the captions to extract different audio events.
Figure 3: Reasoning Accuracy (accuracy on the evaluated benchmarks) and CAFE's perception accuracy across reasoning token lengths. Reasoning accuracy aligns closely with perception accuracy. Specifically in the (a)-(d), both metrics decline as reasoning tokens increase, which we term Audio Perception Decay. The x-axis represents the mean of token length intervals, and the y-axis indicates the accuracy.
Figure 4: An illustration of MPAR$^2$ training pipeline. The top of the figure outlines the overall GRPO training setup, whereas the lower section provides a detailed view of the reward design. The cold start policy model is derived from the Stage 1 SFT-trained model.
Figure 5: Accuracy of MPAR$^2$-7B vs. Base across token lengths.
...and 2 more figures

When Scaling Fails: Mitigating Audio Perception Decay of LALMs via Multi-Step Perception-Aware Reasoning

TL;DR

Abstract

When Scaling Fails: Mitigating Audio Perception Decay of LALMs via Multi-Step Perception-Aware Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)