Table of Contents
Fetching ...

Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models

Haolin He, Xingjian Du, Renhe Sun, Zheqi Dai, Yujia Xiao, Mingru Yang, Jiayi Zhou, Xiquan Li, Zhengxi Liu, Zining Liang, Chunyat Wu, Qianhua He, Tan Lee, Xie Chen, Wei-Long Zheng, Weiqiang Wang, Mark Plumbley, Jian Liu, Qiuqiang Kong

TL;DR

This work addresses the data and training challenges in post-training Large Audio Language Models by introducing the AudioMCQ dataset, a large-scale audio question-answering resource with chain-of-thought annotations. It reveals a prevalent zero audio-contribution phenomenon and proposes Audio-Contribution Filtering to partition data into weak and strong audio-contribution subsets. Two post-training paradigms, Weak-to-Strong and Mixed-to-Strong, are proposed to optimally allocate SFT and RL data, achieving state-of-the-art results on several audio benchmarks and winning the DCASE 2025 Audio-Question-Answering task. The study provides both a valuable dataset and practical data-allocation strategies that enhance audio comprehension in LALMs, with insights into the role of CoT reasoning and the distribution of audio-contributed content. The work will influence future research in audio QA and post-training methodologies by emphasizing task-aligned data selection and robust evaluation.

Abstract

Large Audio Language Models (LALMs) represent an important frontier in multimodal AI, addressing diverse audio tasks. Recently, post-training of LALMs has received increasing attention due to significant performance improvements over foundation models. While single-stage post-training such as reinforcement learning (RL) has demonstrated promising results, multi-stage approaches such as supervised fine-tuning (SFT) followed by RL remain suboptimal. The allocation of data across multiple training stages to maximize LALM capabilities has not been fully explored, and large-scale, high-quality datasets for such research are also lacking. To address these problems, we firstly present AudioMCQ, a comprehensive audio multiple-choice question dataset comprising 571k samples with two kinds of chain-of-thought annotations. Secondly, we investigate the prevalent zero audio-contribution phenomenon in LALMs, where models derive correct answers solely from textual information without processing audio content. We propose Audio-Contribution Filtering to partition data into weak and strong audio-contribution subsets. Based on these insights, we develop two effective post-training paradigms: Weak-to-Strong (SFT on weak audio-contribution data followed by RL on strong audio-contribution data) and Mixed-to-Strong (SFT on mixed audio-contribution data followed by RL on strong audio-contribution data). We achieve first place in the DCASE 2025 Audio-Question-Answering challenge by using AudioMCQ. Additionally, leveraging our dataset with different training strategies, we achieve 78.2\% on MMAU-test-mini, 75.6\% on MMAU, 67.1\% on MMAR, and 70.7\% on MMSU, establishing new state-of-the-art performance.

Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models

TL;DR

This work addresses the data and training challenges in post-training Large Audio Language Models by introducing the AudioMCQ dataset, a large-scale audio question-answering resource with chain-of-thought annotations. It reveals a prevalent zero audio-contribution phenomenon and proposes Audio-Contribution Filtering to partition data into weak and strong audio-contribution subsets. Two post-training paradigms, Weak-to-Strong and Mixed-to-Strong, are proposed to optimally allocate SFT and RL data, achieving state-of-the-art results on several audio benchmarks and winning the DCASE 2025 Audio-Question-Answering task. The study provides both a valuable dataset and practical data-allocation strategies that enhance audio comprehension in LALMs, with insights into the role of CoT reasoning and the distribution of audio-contributed content. The work will influence future research in audio QA and post-training methodologies by emphasizing task-aligned data selection and robust evaluation.

Abstract

Large Audio Language Models (LALMs) represent an important frontier in multimodal AI, addressing diverse audio tasks. Recently, post-training of LALMs has received increasing attention due to significant performance improvements over foundation models. While single-stage post-training such as reinforcement learning (RL) has demonstrated promising results, multi-stage approaches such as supervised fine-tuning (SFT) followed by RL remain suboptimal. The allocation of data across multiple training stages to maximize LALM capabilities has not been fully explored, and large-scale, high-quality datasets for such research are also lacking. To address these problems, we firstly present AudioMCQ, a comprehensive audio multiple-choice question dataset comprising 571k samples with two kinds of chain-of-thought annotations. Secondly, we investigate the prevalent zero audio-contribution phenomenon in LALMs, where models derive correct answers solely from textual information without processing audio content. We propose Audio-Contribution Filtering to partition data into weak and strong audio-contribution subsets. Based on these insights, we develop two effective post-training paradigms: Weak-to-Strong (SFT on weak audio-contribution data followed by RL on strong audio-contribution data) and Mixed-to-Strong (SFT on mixed audio-contribution data followed by RL on strong audio-contribution data). We achieve first place in the DCASE 2025 Audio-Question-Answering challenge by using AudioMCQ. Additionally, leveraging our dataset with different training strategies, we achieve 78.2\% on MMAU-test-mini, 75.6\% on MMAU, 67.1\% on MMAR, and 70.7\% on MMSU, establishing new state-of-the-art performance.

Paper Structure

This paper contains 34 sections, 9 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Overview of dataset construction. Detailed prompts are provided in Appendix \ref{['sec:Prompts']}. Information on the in-pipeline quality check is provided in Appendix \ref{['sec:pipeline_verification']}.
  • Figure 2: Randomly sampled questions from four distinct question types.
  • Figure 3: Distribution analysis of AudioMCQ dataset.
  • Figure 4: Performance comparison of three training approaches on MMAU-test-mini-4k for optimal checkpoint selection. Note that "Mixed-to-X (SFT)" indicates the shared SFT phase of Mixed-to-Mixed and Mixed-to-Strong approaches.
  • Figure 5: Performance comparison of three training approaches across MMAU-test-mini, MMAR, MMSU and their strong audio-contribution splits. Only the optimal checkpoints are displayed.