Table of Contents
Fetching ...

Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLMs

Shree Harsha Bokkahalli Satish, Gustav Eje Henter, Éva Székely

TL;DR

This work questions whether gender-bias behaviors observed in SpeechLLMs on MCQA benchmarks generalise to other tasks and voice contexts. Using LoRA adapters to induce stereotypical, anti-stereotypical, or neutral responses, the authors test transfer across MCQA benchmarks and to long-form generation, introducing new evaluation suites (SAGE, including SAGE-LF) alongside Spoken StereoSet. They find that MCQA-induced bias does not reliably transfer across benchmarks and often fails to predict long-form behavior, with results varying by model and speaker gender; some models even decline to answer when tuned to be unbiased. The study highlights the need for holistic bias evaluation in SpeechLLMs and provides open-source tools to measure behavior transferability across tasks and real-world scenarios.

Abstract

Recent work in benchmarking bias and fairness in speech large language models (SpeechLLMs) has relied heavily on multiple-choice question answering (MCQA) formats. The model is tasked to choose between stereotypical, anti-stereotypical, or neutral/irrelevant answers given an input speech prompt and an optional text prompt. Such MCQA benchmarks implicitly assume that model performance is consistent across other MCQA tasks, voices, and other task formats such as more realistic, long-form evaluations. In this paper, we probe that assumption. We fine-tune three SpeechLLMs using LoRA adapters to induce specific MCQA behaviours: preference for stereotypical, anti-stereotypical, or neutral/uncertain answers. We then evaluate whether these behaviours generalise to another, distinct MCQA benchmark, and more critically to long-form, creative generation tasks. Our results show that performance on MCQA bias benchmarks fails to reliably predict performances across other MCQA benchmarks, and more importantly across long-form tasks. We conclude that current MCQA bias benchmarks show limited evidence of cross-task generalisation in the speech domain, and also propose an evaluation suite for measuring behaviour transferability in future models and benchmarks.

Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLMs

TL;DR

This work questions whether gender-bias behaviors observed in SpeechLLMs on MCQA benchmarks generalise to other tasks and voice contexts. Using LoRA adapters to induce stereotypical, anti-stereotypical, or neutral responses, the authors test transfer across MCQA benchmarks and to long-form generation, introducing new evaluation suites (SAGE, including SAGE-LF) alongside Spoken StereoSet. They find that MCQA-induced bias does not reliably transfer across benchmarks and often fails to predict long-form behavior, with results varying by model and speaker gender; some models even decline to answer when tuned to be unbiased. The study highlights the need for holistic bias evaluation in SpeechLLMs and provides open-source tools to measure behavior transferability across tasks and real-world scenarios.

Abstract

Recent work in benchmarking bias and fairness in speech large language models (SpeechLLMs) has relied heavily on multiple-choice question answering (MCQA) formats. The model is tasked to choose between stereotypical, anti-stereotypical, or neutral/irrelevant answers given an input speech prompt and an optional text prompt. Such MCQA benchmarks implicitly assume that model performance is consistent across other MCQA tasks, voices, and other task formats such as more realistic, long-form evaluations. In this paper, we probe that assumption. We fine-tune three SpeechLLMs using LoRA adapters to induce specific MCQA behaviours: preference for stereotypical, anti-stereotypical, or neutral/uncertain answers. We then evaluate whether these behaviours generalise to another, distinct MCQA benchmark, and more critically to long-form, creative generation tasks. Our results show that performance on MCQA bias benchmarks fails to reliably predict performances across other MCQA benchmarks, and more importantly across long-form tasks. We conclude that current MCQA bias benchmarks show limited evidence of cross-task generalisation in the speech domain, and also propose an evaluation suite for measuring behaviour transferability in future models and benchmarks.

Paper Structure

This paper contains 6 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Example of the lack of behavioural transfer from MCQA benchmarks to long-form outputs in SpeechLLMs
  • Figure 2: Long-form scores in selected dimensions for the baseline and Anti-stereotypical LoRA rank 8 fine-tuned models with 95% bootstrapped CI in brackets. Thicker borders indicate a significant difference over the corresponding baseline. Expected transfer patterns from MCQA to long-form would manifest as reduced emotional validation and increased STEM/leadership/achievement scores for women, contrasted with higher emotional validation and reduced STEM/leadership/achievement scores for men in the anti-stereotypical fine-tuned models.