Table of Contents
Fetching ...

Evaluating Bias in Spoken Dialogue LLMs for Real-World Decisions and Recommendations

Yihao Wu, Tianrui Wang, Yizhou Peng, Yi-Wen Chao, Xuyi Zhuang, Xinsheng Wang, Shunshun Yin, Ziyang Ma

TL;DR

This work tackles the problem of paralinguistic bias in spoken-dialogue LLMs by introducing a FairDialogue framework that evaluates decision-making and recommendation tasks using $GUS$ for decisions and $SNSR$/$SNSV$ for recommendations. It builds a controlled dataset with balanced textual prompts and synthesized audio across age, gender, and accent, enabling rigorous bias analysis with both open-source and closed-source SDMs. The study finds that closed-source models generally exhibit lower bias, while open-source models show stronger age- and gender-related disparities, with recommendation tasks more sensitive to cross-group differences; multi-turn conversations can propagate or amplify these biases. The work provides a first systematic baseline for fair, audio-based interactive systems and releases both the dataset and evaluation code to support ongoing research in mitigating biases in SDMs.

Abstract

While biases in large language models (LLMs), such as stereotypes and cultural tendencies in outputs, have been examined and identified, their presence and characteristics in spoken dialogue models (SDMs) with audio input and output remain largely unexplored. Paralinguistic features, such as age, gender, and accent, can affect model outputs; when compounded by multi-turn conversations, these effects may exacerbate biases, with potential implications for fairness in decision-making and recommendation tasks. In this paper, we systematically evaluate biases in speech LLMs and study the impact of multi-turn dialogues with repeated negative feedback. Bias is measured using Group Unfairness Score (GUS) for decisions and similarity-based normalized statistics rate (SNSR) for recommendations, across both open-source models like Qwen2.5-Omni and GLM-4-Voice, as well as closed-source APIs such as GPT-4o Audio and Gemini-2.5-Flash. Our analysis reveals that closed-source models generally exhibit lower bias, while open-source models are more sensitive to age and gender, and recommendation tasks tend to amplify cross-group disparities. We found that biased decisions may persist in multi-turn conversations. This work provides the first systematic study of biases in end-to-end spoken dialogue models, offering insights towards fair and reliable audio-based interactive systems. To facilitate further research, we release the FairDialogue dataset and evaluation code.

Evaluating Bias in Spoken Dialogue LLMs for Real-World Decisions and Recommendations

TL;DR

This work tackles the problem of paralinguistic bias in spoken-dialogue LLMs by introducing a FairDialogue framework that evaluates decision-making and recommendation tasks using for decisions and / for recommendations. It builds a controlled dataset with balanced textual prompts and synthesized audio across age, gender, and accent, enabling rigorous bias analysis with both open-source and closed-source SDMs. The study finds that closed-source models generally exhibit lower bias, while open-source models show stronger age- and gender-related disparities, with recommendation tasks more sensitive to cross-group differences; multi-turn conversations can propagate or amplify these biases. The work provides a first systematic baseline for fair, audio-based interactive systems and releases both the dataset and evaluation code to support ongoing research in mitigating biases in SDMs.

Abstract

While biases in large language models (LLMs), such as stereotypes and cultural tendencies in outputs, have been examined and identified, their presence and characteristics in spoken dialogue models (SDMs) with audio input and output remain largely unexplored. Paralinguistic features, such as age, gender, and accent, can affect model outputs; when compounded by multi-turn conversations, these effects may exacerbate biases, with potential implications for fairness in decision-making and recommendation tasks. In this paper, we systematically evaluate biases in speech LLMs and study the impact of multi-turn dialogues with repeated negative feedback. Bias is measured using Group Unfairness Score (GUS) for decisions and similarity-based normalized statistics rate (SNSR) for recommendations, across both open-source models like Qwen2.5-Omni and GLM-4-Voice, as well as closed-source APIs such as GPT-4o Audio and Gemini-2.5-Flash. Our analysis reveals that closed-source models generally exhibit lower bias, while open-source models are more sensitive to age and gender, and recommendation tasks tend to amplify cross-group disparities. We found that biased decisions may persist in multi-turn conversations. This work provides the first systematic study of biases in end-to-end spoken dialogue models, offering insights towards fair and reliable audio-based interactive systems. To facilitate further research, we release the FairDialogue dataset and evaluation code.

Paper Structure

This paper contains 11 sections, 4 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: The figure shows a fairness evaluation example for audio dialogue LLMs in interview decision-making. We compare the output of the same utterances with different paralinguistic attributes and examine whether multi-round dialogues alter decisions. In an ideal situation, decision outputs should remain consistent within each attribute category. The left side indicates the paralinguistic attribute categories, and the right side depicts the corresponding real-world scenarios.