Listen and Speak Fairly: A Study on Semantic Gender Bias in Speech Integrated Large Language Models
Yi-Cheng Lin, Tzu-Quan Lin, Chih-Kai Yang, Ke-Han Lu, Wei-Chih Chen, Chun-Yi Kuan, Hung-yi Lee
TL;DR
This study investigates semantic gender bias in Speech Integrated Large Language Models (SILLMs) across four tasks: STT, SCR, SSC, and SQA. It introduces a curated bias evaluation toolkit and a spoken bias dataset to quantify bias and to benchmark cascaded ASR-LLM and end-to-end SLLMs. The results reveal language-dependent bias patterns and show that bias magnitude depends on the evaluation method, with instruction-finetuning reducing bias in STT and SCR but not consistently in SSC and SQA. These contributions provide a practical framework for auditing and debiasing SILLMs and underscore the importance of multilingual and cross-task fairness in real-world deployments.
Abstract
Speech Integrated Large Language Models (SILLMs) combine large language models with speech perception to perform diverse tasks, such as emotion recognition to speaker verification, demonstrating universal audio understanding capability. However, these models may amplify biases present in training data, potentially leading to biased access to information for marginalized groups. This work introduces a curated spoken bias evaluation toolkit and corresponding dataset. We evaluate gender bias in SILLMs across four semantic-related tasks: speech-to-text translation (STT), spoken coreference resolution (SCR), spoken sentence continuation (SSC), and spoken question answering (SQA). Our analysis reveals that bias levels are language-dependent and vary with different evaluation methods. Our findings emphasize the necessity of employing multiple approaches to comprehensively assess biases in SILLMs, providing insights for developing fairer SILLM systems.
