Table of Contents
Fetching ...

DENOASR: Debiasing ASRs through Selective Denoising

Anand Kumar Rai, Siddharth D Jaiswal, Shubham Prakash, Bendi Pragnya Sree, Animesh Mukherjee

TL;DR

A novel framework is introduced which is a selective denoising technique to reduce the disparity in the word error rates between the two gender groups male and female and suggests that selective denoising can be an elegant approach to mitigate biases in present-day ASR systems.

Abstract

Automatic Speech Recognition (ASR) systems have been examined and shown to exhibit biases toward particular groups of individuals, influenced by factors such as demographic traits, accents, and speech styles. Noise can disproportionately impact speakers with certain accents, dialects, or speaking styles, leading to biased error rates. In this work, we introduce a novel framework DENOASR, which is a selective denoising technique to reduce the disparity in the word error rates between the two gender groups, male and female. We find that a combination of two popular speech denoising techniques, viz. DEMUCS and LE, can be effectively used to mitigate ASR disparity without compromising their overall performance. Experiments using two state-of-the-art open-source ASRs - OpenAI WHISPER and NVIDIA NEMO - on multiple benchmark datasets, including TIE, VOX-POPULI, TEDLIUM, and FLEURS, show that there is a promising reduction in the average word error rate gap across the two gender groups. For a given dataset, the denoising is selectively applied on speech samples having speech intelligibility below a certain threshold, estimated using a small validation sample, thus ameliorating the need for large-scale human-written ground-truth transcripts. Our findings suggest that selective denoising can be an elegant approach to mitigate biases in present-day ASR systems.

DENOASR: Debiasing ASRs through Selective Denoising

TL;DR

A novel framework is introduced which is a selective denoising technique to reduce the disparity in the word error rates between the two gender groups male and female and suggests that selective denoising can be an elegant approach to mitigate biases in present-day ASR systems.

Abstract

Automatic Speech Recognition (ASR) systems have been examined and shown to exhibit biases toward particular groups of individuals, influenced by factors such as demographic traits, accents, and speech styles. Noise can disproportionately impact speakers with certain accents, dialects, or speaking styles, leading to biased error rates. In this work, we introduce a novel framework DENOASR, which is a selective denoising technique to reduce the disparity in the word error rates between the two gender groups, male and female. We find that a combination of two popular speech denoising techniques, viz. DEMUCS and LE, can be effectively used to mitigate ASR disparity without compromising their overall performance. Experiments using two state-of-the-art open-source ASRs - OpenAI WHISPER and NVIDIA NEMO - on multiple benchmark datasets, including TIE, VOX-POPULI, TEDLIUM, and FLEURS, show that there is a promising reduction in the average word error rate gap across the two gender groups. For a given dataset, the denoising is selectively applied on speech samples having speech intelligibility below a certain threshold, estimated using a small validation sample, thus ameliorating the need for large-scale human-written ground-truth transcripts. Our findings suggest that selective denoising can be an elegant approach to mitigate biases in present-day ASR systems.

Paper Structure

This paper contains 31 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Effect of denoising in speech spectogram and eventually on ASR transcription performance. The text highlighted in green gets omitted from ASR transcription in presence of noise in the spectogram. Whisper has been used for transcribing a speech sample from the TIE dataset while the denoising strategy used was DEmucs followed by Le.
  • Figure 2: Overview of the $\mathcal{DENOASR}$ framework for debiasing ASRs.
  • Figure 3: (Top) Spectrogram of speech sample with low stoi_like score having significant noise in lower frequencies. (Bottom) Spectrogram of speech sample with high stoi_like score having more noise in higher frequencies.