Table of Contents
Fetching ...

ORCA: Open-ended Response Correctness Assessment for Audio Question Answering

Šimon Sedláček, Sara Barahona, Bolaji Yusuf, Laura Herrera-Alarcón, Santosh Kesiraju, Cecilia Bolaños, Alicia Lozano-Diez, Sathvik Udupa, Fernando López, Allison Ferner, Ramani Duraiswami, Jan Černocký

TL;DR

Open-ended audio QA evaluation is hampered by semantic ambiguity and subjective judgments. ORCA introduces a Beta-distribution–based evaluator that models both the mean correctness and annotator uncertainty, trained via a three-stage annotation framework that yields high-quality data. In experiments on MMAUv05.15.25 and MMAR, ORCA achieves up to 0.91 Spearman correlation with mean human judgments and outperforms LLM-judge baselines while providing explicit uncertainty estimates, using a single forward pass for efficiency. The work provides a scalable, reproducible, and open-resource approach for benchmarked audio QA evaluation and includes release of models, code, and curated annotations.

Abstract

Evaluating open-ended responses from large audio language models (LALMs) is challenging because human annotators often genuinely disagree on answer correctness due to multiple valid interpretations, partial correctness, and subjective judgment. Traditional metrics reporting only mean scores fail to capture this uncertainty. We present ORCA (Open-ended Response Correctness Assessment), a framework that models the variability in human judgments using Beta distributions to predict both expected correctness and uncertainty. Our three-stage annotation framework combines human judgment with structured feedback and iterative refinement to simultaneously curate training data and improve benchmark quality. We collected 11,721 annotations across 3,580 question-answer pairs from 15 LALMs on two audio QA benchmarks, achieving inter-annotator agreement of 0.82 (Krippendorff's alpha). ORCA achieves 0.91 Spearman correlation with mean human judgments, matching or outperforming LLM-judge baselines while providing uncertainty estimates and requiring significantly less compute. We release our models, code, and curated dataset.

ORCA: Open-ended Response Correctness Assessment for Audio Question Answering

TL;DR

Open-ended audio QA evaluation is hampered by semantic ambiguity and subjective judgments. ORCA introduces a Beta-distribution–based evaluator that models both the mean correctness and annotator uncertainty, trained via a three-stage annotation framework that yields high-quality data. In experiments on MMAUv05.15.25 and MMAR, ORCA achieves up to 0.91 Spearman correlation with mean human judgments and outperforms LLM-judge baselines while providing explicit uncertainty estimates, using a single forward pass for efficiency. The work provides a scalable, reproducible, and open-resource approach for benchmarked audio QA evaluation and includes release of models, code, and curated annotations.

Abstract

Evaluating open-ended responses from large audio language models (LALMs) is challenging because human annotators often genuinely disagree on answer correctness due to multiple valid interpretations, partial correctness, and subjective judgment. Traditional metrics reporting only mean scores fail to capture this uncertainty. We present ORCA (Open-ended Response Correctness Assessment), a framework that models the variability in human judgments using Beta distributions to predict both expected correctness and uncertainty. Our three-stage annotation framework combines human judgment with structured feedback and iterative refinement to simultaneously curate training data and improve benchmark quality. We collected 11,721 annotations across 3,580 question-answer pairs from 15 LALMs on two audio QA benchmarks, achieving inter-annotator agreement of 0.82 (Krippendorff's alpha). ORCA achieves 0.91 Spearman correlation with mean human judgments, matching or outperforming LLM-judge baselines while providing uncertainty estimates and requiring significantly less compute. We release our models, code, and curated dataset.

Paper Structure

This paper contains 44 sections, 5 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Two examples with similar mean ratings ($\mu \approx 3$) but different variances: low variance (top) indicates consensus; high variance (bottom) reveals disagreement.
  • Figure 2: Annotation framework pipeline. Data preparation (Stage 1) generates rationales via Gemini, transcripts via Whisper, and candidate answers from LALMs. Annotation (Stage 2) collects correctness scores and structured feedback from humans and LLM-judges. Iterative refinement (Stage 3) implements human-AI corrections based on feedback, with corrected data re-entering the pipeline (Stage 1b).
  • Figure 3: Comparison of LLM-judge (Gemini, offline LLM-judge fusion) to clamped ORCA OLMo-7B when trained and evaluated on the model hold-out sets.
  • Figure 4: Ablation study on ORCA and avg. LLM-judge inputs. We report MAE of the predicted scores with respect to the average human rating. Q, R, T denote the original question, the rationale, and the transcript, respectively.
  • Figure 5: Histogram of human ratings for answer correctness before and after correction (Stage 3).
  • ...and 3 more figures