Table of Contents
Fetching ...

AURA Score: A Metric For Holistic Audio Question Answering Evaluation

Satvik Dixit, Soham Deshmukh, Bhiksha Raj

TL;DR

This work addresses the challenge of evaluating open-ended Audio Question Answering by introducing AQEval, a large human-annotated benchmark designed to systematically assess how well metrics align with human judgments. It critiques existing NLP and audio-captioning metrics for neglecting question context and reasoning, and proposes AURA, a novel metric combining an LLM-based correctness assessment with an audio-grounded entailment module. Across AQEval, AURA achieves state-of-the-art correlation with human judgments, outperforming traditional metrics and even the baseline LLM evaluator, particularly for longer and more nuanced answers. By releasing both AQEval and AURA, the authors provide a robust framework to spur the development of holistic evaluation methods for audio-language models with practical impact on metric design and model development.

Abstract

Audio Question Answering (AQA) is a key task for evaluating Audio-Language Models (ALMs), yet assessing open-ended responses remains challenging. Existing metrics used for AQA such as BLEU, METEOR and BERTScore, mostly adapted from NLP and audio captioning, rely on surface similarity and fail to account for question context, reasoning, and partial correctness. To address the gap in literature, we make three contributions in this work. First, we introduce AQEval to enable systematic benchmarking of AQA metrics. It is the first benchmark of its kind, consisting of 10k model responses annotated by multiple humans for their correctness and relevance. Second, we conduct a comprehensive analysis of existing AQA metrics on AQEval, highlighting weak correlation with human judgment, especially for longer answers. Third, we propose a new metric - AURA score, to better evaluate open-ended model responses. On AQEval, AURA achieves state-of-the-art correlation with human ratings, significantly outperforming all baselines. Through this work, we aim to highlight the limitations of current AQA evaluation methods and motivate better metrics. We release both the AQEval benchmark and the AURA metric to support future research in holistic AQA evaluation.

AURA Score: A Metric For Holistic Audio Question Answering Evaluation

TL;DR

This work addresses the challenge of evaluating open-ended Audio Question Answering by introducing AQEval, a large human-annotated benchmark designed to systematically assess how well metrics align with human judgments. It critiques existing NLP and audio-captioning metrics for neglecting question context and reasoning, and proposes AURA, a novel metric combining an LLM-based correctness assessment with an audio-grounded entailment module. Across AQEval, AURA achieves state-of-the-art correlation with human judgments, outperforming traditional metrics and even the baseline LLM evaluator, particularly for longer and more nuanced answers. By releasing both AQEval and AURA, the authors provide a robust framework to spur the development of holistic evaluation methods for audio-language models with practical impact on metric design and model development.

Abstract

Audio Question Answering (AQA) is a key task for evaluating Audio-Language Models (ALMs), yet assessing open-ended responses remains challenging. Existing metrics used for AQA such as BLEU, METEOR and BERTScore, mostly adapted from NLP and audio captioning, rely on surface similarity and fail to account for question context, reasoning, and partial correctness. To address the gap in literature, we make three contributions in this work. First, we introduce AQEval to enable systematic benchmarking of AQA metrics. It is the first benchmark of its kind, consisting of 10k model responses annotated by multiple humans for their correctness and relevance. Second, we conduct a comprehensive analysis of existing AQA metrics on AQEval, highlighting weak correlation with human judgment, especially for longer answers. Third, we propose a new metric - AURA score, to better evaluate open-ended model responses. On AQEval, AURA achieves state-of-the-art correlation with human ratings, significantly outperforming all baselines. Through this work, we aim to highlight the limitations of current AQA evaluation methods and motivate better metrics. We release both the AQEval benchmark and the AURA metric to support future research in holistic AQA evaluation.

Paper Structure

This paper contains 15 sections, 2 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Examples of metrics failing at AQA evaluation. As the response gets more complex, traditional metrics struggle.
  • Figure 2: Method overview. The AURA metric evaluates a response given the audio, question and reference. The LLM reformulates the question and response into a hypothesis and the entailment model determines if the audio entails the hypothesis. This Audio Entailment score is combined with an LLM-based correctness score as a weighted sum followed by normalization (shown at N) to get the AURA score.
  • Figure 3: Effect of adding the audio entailment term.
  • Figure 4: MTurk interface used for our human annotation. Annotators listened to an audio clip, reviewed the question and reference answer, and provided a binary correctness judgment (Correct/Incorrect) on the candidate response
  • Figure 5: Examples of AQEval question categories. The figure showcases different types of questions and their reference answers from the AQEval dataset.
  • ...and 3 more figures