Table of Contents
Fetching ...

Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions

Marcel Gibier, Nolwenn Celton, Raphaël Duroselle, Pierre Serrano, Olivier Boeffard, Jean-François Bonastre

TL;DR

This work rethinks Audio Question Answering by moving from end-to-end audio reasoning to a symbolic, calibrated pipeline that converts detected acoustic events into textual descriptions for a language model. It couples per-class LR-based calibration with priors adjustment to produce reliable segment-level predictions, which feed a prompt for a GRPO-finetuned LLM (Qwen2.5-7B-Instruct) to answer questions. Key contributions include the LR calibration with prior-aware posterior conversion and the use of Group Relative Policy Optimization (GRPO) for RLHF fine-tuning, achieving strong development performance (around 62.6% accuracy) and outperforming baselines like Gemini-2.0-Flash. The approach enhances interpretability and reliability in AQA by leveraging structured acoustic reasoning and controlled language-model adaptation, with promising implications for real-world audio understanding tasks.

Abstract

In this report, we describe our submission to Track 5 of the DCASE 2025 Challenge for the task of Audio Question Answering(AQA). Our system leverages the SSL backbone BEATs to extract frame-level audio features, which are then processed by a classification head to generate segment-level predictions of acoustic events, following the Audioset ontology. These segment-level predictions are subsequently calibrated before producing event-level predictions. Finally, these predictions are incorporated into a structured prompt, along with the question and candidate answers. This prompt is then fed to a fine-tuned version of Qwen2.5-7B-Instruct, trained using the GRPO algorithm with a simple reward function. Our method achieves an accuracy of 62.6 % on the development set, demonstrating the effectiveness of combining acoustic event reasoning with instruction-tuned large language models for AQA.

Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions

TL;DR

This work rethinks Audio Question Answering by moving from end-to-end audio reasoning to a symbolic, calibrated pipeline that converts detected acoustic events into textual descriptions for a language model. It couples per-class LR-based calibration with priors adjustment to produce reliable segment-level predictions, which feed a prompt for a GRPO-finetuned LLM (Qwen2.5-7B-Instruct) to answer questions. Key contributions include the LR calibration with prior-aware posterior conversion and the use of Group Relative Policy Optimization (GRPO) for RLHF fine-tuning, achieving strong development performance (around 62.6% accuracy) and outperforming baselines like Gemini-2.0-Flash. The approach enhances interpretability and reliability in AQA by leveraging structured acoustic reasoning and controlled language-model adaptation, with promising implications for real-world audio understanding tasks.

Abstract

In this report, we describe our submission to Track 5 of the DCASE 2025 Challenge for the task of Audio Question Answering(AQA). Our system leverages the SSL backbone BEATs to extract frame-level audio features, which are then processed by a classification head to generate segment-level predictions of acoustic events, following the Audioset ontology. These segment-level predictions are subsequently calibrated before producing event-level predictions. Finally, these predictions are incorporated into a structured prompt, along with the question and candidate answers. This prompt is then fed to a fine-tuned version of Qwen2.5-7B-Instruct, trained using the GRPO algorithm with a simple reward function. Our method achieves an accuracy of 62.6 % on the development set, demonstrating the effectiveness of combining acoustic event reasoning with instruction-tuned large language models for AQA.

Paper Structure

This paper contains 19 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Our audio question-answering system is based on two main steps: (i) we decode the class presence probabilities output by the sound event detection model to obtain a time-localized, textual sound description of the audio; (ii) this text, along with the question, is then provided to a language model, allowing it to answer without directly analyzing the audio.
  • Figure 2: Reliability curve for the class Male Speech, man speaking.
  • Figure 3: Class-wise CLLR Comparison on AudioSet – Before vs After Calibration