Table of Contents
Fetching ...

Exploring Gender Disparities in Automatic Speech Recognition Technology

Hend ElGhazaly, Bahman Mirheidari, Nafise Sadat Moosavi, Heidi Christensen

TL;DR

This work interrogates gender bias in automatic speech recognition beyond basic demographics by manipulating the gender composition of training data while controlling content. Using LibriSpeech and the Whisper small model, it shows that the relationship between gender ratio and $WER$ is nonlinear, with peak fairness occurring around a 60–70% female representation rather than a 50/50 balance. The study further finds that text readability and semantic similarity little explain $WER$ variability, whereas pitch distribution across speakers significantly influences performance, highlighting pitch diversity as a practical bias-mitigation lever. The findings underscore the need for holistic dataset curation to reduce gender bias in ASR and motivate cross-language and cross-model validation of these patterns.

Abstract

This study investigates factors influencing Automatic Speech Recognition (ASR) systems' fairness and performance across genders, beyond the conventional examination of demographics. Using the LibriSpeech dataset and the Whisper small model, we analyze how performance varies across different gender representations in training data. Our findings suggest a complex interplay between the gender ratio in training data and ASR performance. Optimal fairness occurs at specific gender distributions rather than a simple 50-50 split. Furthermore, our findings suggest that factors like pitch variability can significantly affect ASR accuracy. This research contributes to a deeper understanding of biases in ASR systems, highlighting the importance of carefully curated training data in mitigating gender bias.

Exploring Gender Disparities in Automatic Speech Recognition Technology

TL;DR

This work interrogates gender bias in automatic speech recognition beyond basic demographics by manipulating the gender composition of training data while controlling content. Using LibriSpeech and the Whisper small model, it shows that the relationship between gender ratio and is nonlinear, with peak fairness occurring around a 60–70% female representation rather than a 50/50 balance. The study further finds that text readability and semantic similarity little explain variability, whereas pitch distribution across speakers significantly influences performance, highlighting pitch diversity as a practical bias-mitigation lever. The findings underscore the need for holistic dataset curation to reduce gender bias in ASR and motivate cross-language and cross-model validation of these patterns.

Abstract

This study investigates factors influencing Automatic Speech Recognition (ASR) systems' fairness and performance across genders, beyond the conventional examination of demographics. Using the LibriSpeech dataset and the Whisper small model, we analyze how performance varies across different gender representations in training data. Our findings suggest a complex interplay between the gender ratio in training data and ASR performance. Optimal fairness occurs at specific gender distributions rather than a simple 50-50 split. Furthermore, our findings suggest that factors like pitch variability can significantly affect ASR accuracy. This research contributes to a deeper understanding of biases in ASR systems, highlighting the importance of carefully curated training data in mitigating gender bias.

Paper Structure

This paper contains 11 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Mean WERs of TestOther evaluation across the fine-tuned models. The dashed horizontal lines show the results from the original Whisper model.
  • Figure 2: Comparisons between training and test sets' text difficulty (left) and semantic similarity (right).
  • Figure 3: Illustrations of the pitch distributions in the training subsets (in red) and TestOther set (in blue).