Gender Bias in Instruction-Guided Speech Synthesis Models
Chun-Yi Kuan, Hung-yi Lee
TL;DR
This work investigates gender bias in instruction-guided text-to-speech (TTS) models when prompted with occupation-based style cues. It designs a comprehensive prompt-and-content pipeline across 109 occupations and 4 model sizes (Parler-TTS Large v1, Mini v1, Mini v0.1, Mini Expresso), employing automatic gender, emotion, and speaking-rate analyses alongside control baselines. Across occupations, the study finds persistent gender associations—masculine occupations bias toward male voices and feminine ones toward female voices—with model-size–dependent variation. Attempts to mitigate bias via inference-time prompting show limited and inconsistent effectiveness, occasionally reversing bias in some cases but not offering a robust solution. These findings highlight the ongoing need for principled bias mitigation in style-controlled TTS and inform both researchers and practitioners about the potential societal implications of occupation-related prompts.
Abstract
Recent advancements in controllable expressive speech synthesis, especially in text-to-speech (TTS) models, have allowed for the generation of speech with specific styles guided by textual descriptions, known as style prompts. While this development enhances the flexibility and naturalness of synthesized speech, there remains a significant gap in understanding how these models handle vague or abstract style prompts. This study investigates the potential gender bias in how models interpret occupation-related prompts, specifically examining their responses to instructions like "Act like a nurse". We explore whether these models exhibit tendencies to amplify gender stereotypes when interpreting such prompts. Our experimental results reveal the model's tendency to exhibit gender bias for certain occupations. Moreover, models of different sizes show varying degrees of this bias across these occupations.
