Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models
Yi-Cheng Lin, Wei-Chih Chen, Hung-yi Lee
TL;DR
This work addresses social bias in speech large language models by introducing Spoken Stereoset, a corpus designed to probe speaker-based stereotypes in gender and age. It constructs spoken-context prompts via TTS, paired with three continuations (stereotypical, anti-stereotypical, irrelevant) and evaluates several SLLMs using three bias metrics (slifs, slms, slbs) plus diversity via ROUGE-L. The experiments show SALMONN models generally achieve high instruction-following and low bias, with some anti-stereotypical tendencies in the age domain, while some models struggle with determining answers. Text-only analyses suggest much of the bias stems from the speech encoder, highlighting the need for bias mitigation in multimodal speech-language systems. The work lays groundwork for fairer SLLMs and suggests expanding the dataset to cover more demographics and scenarios.
Abstract
Warning: This paper may contain texts with uncomfortable content. Large Language Models (LLMs) have achieved remarkable performance in various tasks, including those involving multimodal data like speech. However, these models often exhibit biases due to the nature of their training data. Recently, more Speech Large Language Models (SLLMs) have emerged, underscoring the urgent need to address these biases. This study introduces Spoken Stereoset, a dataset specifically designed to evaluate social biases in SLLMs. By examining how different models respond to speech from diverse demographic groups, we aim to identify these biases. Our experiments reveal significant insights into their performance and bias levels. The findings indicate that while most models show minimal bias, some still exhibit slightly stereotypical or anti-stereotypical tendencies.
