AQA-TTRL: Self-Adaptation in Audio Question Answering with Test-Time Reinforcement Learning
Haoyu Zhang, Jiaxian Guo, Yusuke Iwasawa, Yutaka Matsuo
TL;DR
The paper addresses the challenge of static, deployed LALMs by proposing AQA-TTRL, a test-time self-adaptation framework for audio question answering that relies solely on unlabeled test data. It constructs a closed-loop learning process where pseudo-labels are produced via majority voting and the model is updated with GRPO, moderated by a confidence-weighted advantage and a multiple-attempt sampling strategy to mitigate noise and instability. Empirical results on MMAU, MMAR, and MMSU show substantial gains, with 4.42% average improvement for 7B and 11.04% for 3B, and notably the adapted 3B model can outperform the unadapted 7B model in direct inference. This work demonstrates the viability and impact of test-time self-evolution for complex audio understanding tasks, offering a path toward more robust, data-efficient deployment of LALMs.
Abstract
Large Audio Language Models (LALMs) demonstrate impressive general audio understanding, but once deployed, they are static and fail to improve with new real-world audio data. As traditional supervised fine-tuning is costly, we introduce a novel framework for test-time audio understanding, AQA-TTRL, where an LALM evolves on-the-fly using only unlabeled test data. It first generates pseudo-labels from the prediction via majority voting, then optimizes the model via reinforcement learning. To handle the inherent noise in these self-generated labels, we introduce a confidence-based weighting method to adjust training signals. Furthermore, a multiple-attempt sampling operation mitigates advantage collapse and stabilizes training. On the MMAU (test-mini/test), MMAR, and MMSU benchmarks, AQA-TTRL achieves significant average improvements of 4.42% for the Qwen2.5-Omni 7B model and 11.04% for the 3B model. Notably, the adapted 3B model consistently outperforms the direct inference of the unadapted 7B model, highlighting the effectiveness of previously unexplored test-time adaptations in audio understanding.
