Table of Contents
Fetching ...

Unveiling Biases while Embracing Sustainability: Assessing the Dual Challenges of Automatic Speech Recognition Systems

Ajinkya Kulkarni, Atharva Kulkarni, Miguel Couceiro, Isabel Trancoso

TL;DR

This work investigates dual challenges in automatic speech recognition: biases across gender, age, and accents, and the environmental footprint of large ASR systems. By evaluating MMS and Whisper on bias-focused datasets Artie-Bias and CCv2, and measuring inference-time energy use and carbon emissions with three tracking tools across multiple GPUs, the study provides a comprehensive view of fairness and sustainability in real-world ASR. Findings show Whisper often outperforms MMS on read speech for bias metrics but can underperform on spontaneous speech, while MMS generally offers better sustainability; larger Whisper variants may underperform the medium size in some cases. The results highlight the importance of multi-metric benchmarking, the role of language adapters, and hardware characteristics in shaping both fairness and ecological impact, informing responsible deployment decisions in diverse linguistic contexts.

Abstract

In this paper, we present a bias and sustainability focused investigation of Automatic Speech Recognition (ASR) systems, namely Whisper and Massively Multilingual Speech (MMS), which have achieved state-of-the-art (SOTA) performances. Despite their improved performance in controlled settings, there remains a critical gap in understanding their efficacy and equity in real-world scenarios. We analyze ASR biases w.r.t. gender, accent, and age group, as well as their effect on downstream tasks. In addition, we examine the environmental impact of ASR systems, scrutinizing the use of large acoustic models on carbon emission and energy consumption. We also provide insights into our empirical analyses, offering a valuable contribution to the claims surrounding bias and sustainability in ASR systems.

Unveiling Biases while Embracing Sustainability: Assessing the Dual Challenges of Automatic Speech Recognition Systems

TL;DR

This work investigates dual challenges in automatic speech recognition: biases across gender, age, and accents, and the environmental footprint of large ASR systems. By evaluating MMS and Whisper on bias-focused datasets Artie-Bias and CCv2, and measuring inference-time energy use and carbon emissions with three tracking tools across multiple GPUs, the study provides a comprehensive view of fairness and sustainability in real-world ASR. Findings show Whisper often outperforms MMS on read speech for bias metrics but can underperform on spontaneous speech, while MMS generally offers better sustainability; larger Whisper variants may underperform the medium size in some cases. The results highlight the importance of multi-metric benchmarking, the role of language adapters, and hardware characteristics in shaping both fairness and ecological impact, informing responsible deployment decisions in diverse linguistic contexts.

Abstract

In this paper, we present a bias and sustainability focused investigation of Automatic Speech Recognition (ASR) systems, namely Whisper and Massively Multilingual Speech (MMS), which have achieved state-of-the-art (SOTA) performances. Despite their improved performance in controlled settings, there remains a critical gap in understanding their efficacy and equity in real-world scenarios. We analyze ASR biases w.r.t. gender, accent, and age group, as well as their effect on downstream tasks. In addition, we examine the environmental impact of ASR systems, scrutinizing the use of large acoustic models on carbon emission and energy consumption. We also provide insights into our empirical analyses, offering a valuable contribution to the claims surrounding bias and sustainability in ASR systems.

Paper Structure

This paper contains 15 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Speech utterance distribution across gender, accent, and age for Artie-Bias and CCv2 dataset.
  • Figure 2: Bar plots depicting Whisper and MMS ASR performances across gender, accent, and age. Whisper ASR variants are indicated respectively as Whisper-Medium (W-M), Whisper-Large (W-L), Whisper-Large-V2 (W-L-V2), and Whisper-Large-V3 (W-L-V3).
  • Figure 3: Bar plots depicting carbon emissions (first row) and energy consumption (second row) of MMS and Whisper variants (W-M, W-L, W-L-V2, W-L-V3) obtained by carbontracker, codecarbon and eco2ai, as described in Subsection \ref{['sustainability']}. The bar clusters correspond in each bar plot correspond to the 4 NVIDIA GPUs, namely, RTX-5000-16GB, RTX-A5000-24GB, A100-48GB, and A6000-48GB.