Table of Contents
Fetching ...

The Binding Effect: Analyzing How Multi-Dimensional Cues Form Gender Bias in Instruction TTS

Kuan-Yu Chen, Yi-Cheng Lin, Po-Chung Hsieh, Huang-Cheng Chou, Chih-Fan Hsu, Jeng-Lin Li, Hung-yi Lee, Jian-Jiun Ding

Abstract

Current bias evaluations in Instruction Text-to-Speech (ITTS) often rely on univariate testing, overlooking the compositional structure of social cues. In this work, we investigate gender bias by modeling prompts as combinations of Social Status, Career stereotypes, and Persona descriptors. Analyzing open-source ITTS models, we uncover systematic interaction effects where social dimensions modulate one another, creating complex bias patterns missed by univariate baselines. Crucially, our findings indicate that these biases extend beyond surface-level artifacts, demonstrating strong associations with the semantic priors of pre-trained text encoders and the skewed distributions inherent in training data. We further demonstrate that generic diversity prompting is insufficient to override these entrenched patterns, underscoring the need for compositional analysis to diagnose latent risks in generative speech.

The Binding Effect: Analyzing How Multi-Dimensional Cues Form Gender Bias in Instruction TTS

Abstract

Current bias evaluations in Instruction Text-to-Speech (ITTS) often rely on univariate testing, overlooking the compositional structure of social cues. In this work, we investigate gender bias by modeling prompts as combinations of Social Status, Career stereotypes, and Persona descriptors. Analyzing open-source ITTS models, we uncover systematic interaction effects where social dimensions modulate one another, creating complex bias patterns missed by univariate baselines. Crucially, our findings indicate that these biases extend beyond surface-level artifacts, demonstrating strong associations with the semantic priors of pre-trained text encoders and the skewed distributions inherent in training data. We further demonstrate that generic diversity prompting is insufficient to override these entrenched patterns, underscoring the need for compositional analysis to diagnose latent risks in generative speech.
Paper Structure (16 sections, 7 equations, 1 figure, 6 tables)

This paper contains 16 sections, 7 equations, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Overall Framework. Two-stage evaluation demonstrated with PromptTTS++ model. Stage 1 establishes univariate gender priors, where an isolated descriptor like nurse triggers a strong female bias ($P(\mathbf{x})=0.99$). In Stage 2, recombining tokens with attributes like high-status and reckless creates a binding effect to the original female-leaning nurse, shifting the perceived gender toward male ($P(\mathbf{x})=0.17$).