Hearing the Order: Investigating Selection Bias in Large Audio-Language Models
Yu-Xiang Lin, Chen-An Li, Sheng-Lun Wei, Po-Chun Chen, Hsin-Hsi Chen, Hung-yi Lee
TL;DR
This work investigates whether large audio-language models (LALMs) exhibit selection bias when solving multiple-choice questions, where answer-option order can drive predictions independently of content. The authors reassign the correct answer to each option position and apply permutation-based evaluation across six LALMs and three MCQ benchmarks (including spoken variants), quantifying bias with Δ accuracy, Relative Standard Deviation (RSD), and CKLD. They find pervasive biases across models and datasets, with fluctuations up to about 24% and frequent ranking changes; while identifiers improve accuracy, they do not reliably curb bias, and full permutation provides the most reliable bias mitigation at the cost of extra compute. The study demonstrates the critical need for robust, permutation-based or alternative evaluation frameworks to ensure fair, reliable assessments of LALMs' reasoning in audio-visual contexts and motivates future methods to mitigate order-induced artifacts.
Abstract
Large audio-language models (LALMs) are often used in tasks that involve reasoning over ordered options. An open question is whether their predictions are influenced by the order of answer choices, which would indicate a form of selection bias and undermine their reliability. In this paper, we identify and analyze this problem in LALMs. We demonstrate that no model is immune to this bias through extensive experiments on six LALMs across three widely used benchmarks and their spoken counterparts. Shuffling the order of answer options can cause performance fluctuations of up to 24% and even change model rankings, raising concerns about the reliability of current evaluation practices. We also study permutation-based strategies and show that they can mitigate bias in most cases. Our work represents the first systematic investigation of this issue in LALMs, and we hope it raises awareness and motivates further research in this direction.
