Table of Contents
Fetching ...

Everyone deserves their voice to be heard: Analyzing Predictive Gender Bias in ASR Models Applied to Dutch Speech Data

Rik Raes, Saskia Lensink, Mykola Pechenizkiy

TL;DR

The moral framework of Weerts et al. (2022) is used to assess quality of service harms and fairness, and to provide a nuanced discussion on the implications of these biases, particularly for automatic subtitling.

Abstract

Recent research has shown that state-of-the-art (SotA) Automatic Speech Recognition (ASR) systems, such as Whisper, often exhibit predictive biases that disproportionately affect various demographic groups. This study focuses on identifying the performance disparities of Whisper models on Dutch speech data from the Common Voice dataset and the Dutch National Public Broadcasting organisation. We analyzed the word error rate, character error rate and a BERT-based semantic similarity across gender groups. We used the moral framework of Weerts et al. (2022) to assess quality of service harms and fairness, and to provide a nuanced discussion on the implications of these biases, particularly for automatic subtitling. Our findings reveal substantial disparities in word error rate (WER) among gender groups across all model sizes, with bias identified through statistical testing.

Everyone deserves their voice to be heard: Analyzing Predictive Gender Bias in ASR Models Applied to Dutch Speech Data

TL;DR

The moral framework of Weerts et al. (2022) is used to assess quality of service harms and fairness, and to provide a nuanced discussion on the implications of these biases, particularly for automatic subtitling.

Abstract

Recent research has shown that state-of-the-art (SotA) Automatic Speech Recognition (ASR) systems, such as Whisper, often exhibit predictive biases that disproportionately affect various demographic groups. This study focuses on identifying the performance disparities of Whisper models on Dutch speech data from the Common Voice dataset and the Dutch National Public Broadcasting organisation. We analyzed the word error rate, character error rate and a BERT-based semantic similarity across gender groups. We used the moral framework of Weerts et al. (2022) to assess quality of service harms and fairness, and to provide a nuanced discussion on the implications of these biases, particularly for automatic subtitling. Our findings reveal substantial disparities in word error rate (WER) among gender groups across all model sizes, with bias identified through statistical testing.

Paper Structure

This paper contains 15 sections, 4 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Flowchart of the evaluation methodology. Predictions are made by a model on a data set. The performance of this model is then evaluated by measuring the WER, CER, and BSS on all predictions with a weight according to the number of words in the instance. Next, the weighted average of these scores is taken by aggregating over the speaker IDs to remove data dependencies and perform statistical testing to find out whether predictive bias exists. Last, the models are evaluated on fairness using the WER Parity equation (\ref{['eq:mean_disparity_comb']}) on the weighted average WER scores.