Table of Contents
Fetching ...

AQA-TTRL: Self-Adaptation in Audio Question Answering with Test-Time Reinforcement Learning

Haoyu Zhang, Jiaxian Guo, Yusuke Iwasawa, Yutaka Matsuo

TL;DR

The paper addresses the challenge of static, deployed LALMs by proposing AQA-TTRL, a test-time self-adaptation framework for audio question answering that relies solely on unlabeled test data. It constructs a closed-loop learning process where pseudo-labels are produced via majority voting and the model is updated with GRPO, moderated by a confidence-weighted advantage and a multiple-attempt sampling strategy to mitigate noise and instability. Empirical results on MMAU, MMAR, and MMSU show substantial gains, with 4.42% average improvement for 7B and 11.04% for 3B, and notably the adapted 3B model can outperform the unadapted 7B model in direct inference. This work demonstrates the viability and impact of test-time self-evolution for complex audio understanding tasks, offering a path toward more robust, data-efficient deployment of LALMs.

Abstract

Large Audio Language Models (LALMs) demonstrate impressive general audio understanding, but once deployed, they are static and fail to improve with new real-world audio data. As traditional supervised fine-tuning is costly, we introduce a novel framework for test-time audio understanding, AQA-TTRL, where an LALM evolves on-the-fly using only unlabeled test data. It first generates pseudo-labels from the prediction via majority voting, then optimizes the model via reinforcement learning. To handle the inherent noise in these self-generated labels, we introduce a confidence-based weighting method to adjust training signals. Furthermore, a multiple-attempt sampling operation mitigates advantage collapse and stabilizes training. On the MMAU (test-mini/test), MMAR, and MMSU benchmarks, AQA-TTRL achieves significant average improvements of 4.42% for the Qwen2.5-Omni 7B model and 11.04% for the 3B model. Notably, the adapted 3B model consistently outperforms the direct inference of the unadapted 7B model, highlighting the effectiveness of previously unexplored test-time adaptations in audio understanding.

AQA-TTRL: Self-Adaptation in Audio Question Answering with Test-Time Reinforcement Learning

TL;DR

The paper addresses the challenge of static, deployed LALMs by proposing AQA-TTRL, a test-time self-adaptation framework for audio question answering that relies solely on unlabeled test data. It constructs a closed-loop learning process where pseudo-labels are produced via majority voting and the model is updated with GRPO, moderated by a confidence-weighted advantage and a multiple-attempt sampling strategy to mitigate noise and instability. Empirical results on MMAU, MMAR, and MMSU show substantial gains, with 4.42% average improvement for 7B and 11.04% for 3B, and notably the adapted 3B model can outperform the unadapted 7B model in direct inference. This work demonstrates the viability and impact of test-time self-evolution for complex audio understanding tasks, offering a path toward more robust, data-efficient deployment of LALMs.

Abstract

Large Audio Language Models (LALMs) demonstrate impressive general audio understanding, but once deployed, they are static and fail to improve with new real-world audio data. As traditional supervised fine-tuning is costly, we introduce a novel framework for test-time audio understanding, AQA-TTRL, where an LALM evolves on-the-fly using only unlabeled test data. It first generates pseudo-labels from the prediction via majority voting, then optimizes the model via reinforcement learning. To handle the inherent noise in these self-generated labels, we introduce a confidence-based weighting method to adjust training signals. Furthermore, a multiple-attempt sampling operation mitigates advantage collapse and stabilizes training. On the MMAU (test-mini/test), MMAR, and MMSU benchmarks, AQA-TTRL achieves significant average improvements of 4.42% for the Qwen2.5-Omni 7B model and 11.04% for the 3B model. Notably, the adapted 3B model consistently outperforms the direct inference of the unadapted 7B model, highlighting the effectiveness of previously unexplored test-time adaptations in audio understanding.

Paper Structure

This paper contains 13 sections, 5 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Role of AQA-TTRL: When faced with unseen test data, AQA-TTRL enables an automatic adaptation pipeline by eliminating the need for manual annotation.
  • Figure 2: Overview of the AQA-TTRL. The framework derives pseudo-labels with confidence through majority voting, conducts multiple-attempt sampling for effective response generation, and adjusts the learning signal by weighting it with confidence.
  • Figure 3: Accuracy vs. Confidence of Pseudo-Labels
  • Figure 4: Case study: accuracy over training steps (MMSU). To preserve the label-free setting, we report performance at the fixed 500th step. However, better performance may occur at intermediate steps. Given that MMSU contains 5,000 samples and the global training batch size is 8, this suggests that the model can reach strong performance before fully traversing the dataset. This highlights an avenue for future work---designing label-free model selection strategies that enhance training effectiveness while reducing computational cost.