Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLMs
Shree Harsha Bokkahalli Satish, Gustav Eje Henter, Éva Székely
TL;DR
This work questions whether gender-bias behaviors observed in SpeechLLMs on MCQA benchmarks generalise to other tasks and voice contexts. Using LoRA adapters to induce stereotypical, anti-stereotypical, or neutral responses, the authors test transfer across MCQA benchmarks and to long-form generation, introducing new evaluation suites (SAGE, including SAGE-LF) alongside Spoken StereoSet. They find that MCQA-induced bias does not reliably transfer across benchmarks and often fails to predict long-form behavior, with results varying by model and speaker gender; some models even decline to answer when tuned to be unbiased. The study highlights the need for holistic bias evaluation in SpeechLLMs and provides open-source tools to measure behavior transferability across tasks and real-world scenarios.
Abstract
Recent work in benchmarking bias and fairness in speech large language models (SpeechLLMs) has relied heavily on multiple-choice question answering (MCQA) formats. The model is tasked to choose between stereotypical, anti-stereotypical, or neutral/irrelevant answers given an input speech prompt and an optional text prompt. Such MCQA benchmarks implicitly assume that model performance is consistent across other MCQA tasks, voices, and other task formats such as more realistic, long-form evaluations. In this paper, we probe that assumption. We fine-tune three SpeechLLMs using LoRA adapters to induce specific MCQA behaviours: preference for stereotypical, anti-stereotypical, or neutral/uncertain answers. We then evaluate whether these behaviours generalise to another, distinct MCQA benchmark, and more critically to long-form, creative generation tasks. Our results show that performance on MCQA bias benchmarks fails to reliably predict performances across other MCQA benchmarks, and more importantly across long-form tasks. We conclude that current MCQA bias benchmarks show limited evidence of cross-task generalisation in the speech domain, and also propose an evaluation suite for measuring behaviour transferability in future models and benchmarks.
