Table of Contents
Fetching ...

The Balancing Act: Unmasking and Alleviating ASR Biases in Portuguese

Ajinkya Kulkarni, Anna Tokareva, Rameez Qureshi, Miguel Couceiro

TL;DR

The paper examines biases in state-of-the-art multilingual ASR systems for Portuguese by comparing Whisper and MMS on the CCD V2 scripted dataset. It combines naive and SMOTE oversampling with WER/CER metrics and gender-p-value tests to quantify disparities across gender, age, skin tone, and geo-location. Results show MMS generally offers more balanced performance than Whisper, and SMOTE oversampling mitigates several biases, though some remain language- and region-dependent. The study highlights data distribution as a key driver of fairness in multilingual ASR and suggests directions for extending bias analyses to additional languages and datasets.

Abstract

In the field of spoken language understanding, systems like Whisper and Multilingual Massive Speech (MMS) have shown state-of-the-art performances. This study is dedicated to a comprehensive exploration of the Whisper and MMS systems, with a focus on assessing biases in automatic speech recognition (ASR) inherent to casual conversation speech specific to the Portuguese language. Our investigation encompasses various categories, including gender, age, skin tone color, and geo-location. Alongside traditional ASR evaluation metrics such as Word Error Rate (WER), we have incorporated p-value statistical significance for gender bias analysis. Furthermore, we extensively examine the impact of data distribution and empirically show that oversampling techniques alleviate such stereotypical biases. This research represents a pioneering effort in quantifying biases in the Portuguese language context through the application of MMS and Whisper, contributing to a better understanding of ASR systems' performance in multilingual settings.

The Balancing Act: Unmasking and Alleviating ASR Biases in Portuguese

TL;DR

The paper examines biases in state-of-the-art multilingual ASR systems for Portuguese by comparing Whisper and MMS on the CCD V2 scripted dataset. It combines naive and SMOTE oversampling with WER/CER metrics and gender-p-value tests to quantify disparities across gender, age, skin tone, and geo-location. Results show MMS generally offers more balanced performance than Whisper, and SMOTE oversampling mitigates several biases, though some remain language- and region-dependent. The study highlights data distribution as a key driver of fairness in multilingual ASR and suggests directions for extending bias analyses to additional languages and datasets.

Abstract

In the field of spoken language understanding, systems like Whisper and Multilingual Massive Speech (MMS) have shown state-of-the-art performances. This study is dedicated to a comprehensive exploration of the Whisper and MMS systems, with a focus on assessing biases in automatic speech recognition (ASR) inherent to casual conversation speech specific to the Portuguese language. Our investigation encompasses various categories, including gender, age, skin tone color, and geo-location. Alongside traditional ASR evaluation metrics such as Word Error Rate (WER), we have incorporated p-value statistical significance for gender bias analysis. Furthermore, we extensively examine the impact of data distribution and empirically show that oversampling techniques alleviate such stereotypical biases. This research represents a pioneering effort in quantifying biases in the Portuguese language context through the application of MMS and Whisper, contributing to a better understanding of ASR systems' performance in multilingual settings.
Paper Structure (16 sections, 1 equation, 5 figures, 2 tables)

This paper contains 16 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Bar plots depicting Whisper and ASR performance across the Fitzpatrick skin-tone scale, ranging from type-I to type-VI, for both male and female genders, with results for initial samples, naïve sampling, and SMOTE sampling
  • Figure 2: Bar-plots demonstrating performance of multilingual ASR systems using Whisper ASR variants and MMS for impact on male and female genders using WER under three sampling methods, initial, naïve and SMOTE. Whisper ASR variants are indicated as, Whisper-Large (W-L), Whisper-Large-V2 (W-L-V2), and Whisper-Medium (W-M).
  • Figure 3: Bar-plots illustrating the distribution of mean WER for Fitzpatrick skin tone scales across Initial, naïve, and SMOTE sampling methods.
  • Figure 4: Bar-plots illustrating distribution of WER for age groups categorized into five sub-sets (18-24, 25-30, 31-36, 37-42, 42-50, 51-60, 61+) across initial, naïve and SMOTE sampling methods.
  • Figure 5: The visualization of mean WER distribution in each Portuguese state. The abbreviations of states are as follows: RN - Rio Grande do Norte, SP - Sao Paulo, RS - Rio Grande do Sul, GO -Goias, MT - Mato Grosso, PR - Parana, RJ - Rio de Janeiro, MG - Minas Gerais, PI - Piaui, PE - Pernambuco, MA - Maranhao.