Table of Contents
Fetching ...

Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities

Pranav Dheram, Murugesan Ramakrishnan, Anirudh Raju, I-Fan Chen, Brian King, Katherine Powell, Melissa Saboowala, Karan Shetty, Andreas Stolcke

TL;DR

This work tackles fairness in automatic speech recognition by (1) identifying underperforming speaker cohorts through geodemographic proxies and an automatic, label-free embedding-based method, and (2) mitigating disparities via oversampling bottom-cohort data with semi-supervised training and by incorporating a 2D cohort embedding into the acoustic model. The automatic, speaker-embedding approach yields larger and more scalable discovery of disparities than geodemographic methods. Both oversampling and cohort embeddings reduce the relative $WER$-gap between top and bottom cohorts (from about 56% to roughly 40%), with minimal impact on top-cohort accuracy; combining the two yields no additional gains due to imperfect alignment between demographic labels and acoustic characteristics. The results demonstrate practical, production-scale improvements in ASR fairness and point to future work on interpretable automatic cohorts and loss-based or adaptation-based mitigation strategies.

Abstract

As for other forms of AI, speech recognition has recently been examined with respect to performance disparities across different user cohorts. One approach to achieve fairness in speech recognition is to (1) identify speaker cohorts that suffer from subpar performance and (2) apply fairness mitigation measures targeting the cohorts discovered. In this paper, we report on initial findings with both discovery and mitigation of performance disparities using data from a product-scale AI assistant speech recognition system. We compare cohort discovery based on geographic and demographic information to a more scalable method that groups speakers without human labels, using speaker embedding technology. For fairness mitigation, we find that oversampling of underrepresented cohorts, as well as modeling speaker cohort membership by additional input variables, reduces the gap between top- and bottom-performing cohorts, without deteriorating overall recognition accuracy.

Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities

TL;DR

This work tackles fairness in automatic speech recognition by (1) identifying underperforming speaker cohorts through geodemographic proxies and an automatic, label-free embedding-based method, and (2) mitigating disparities via oversampling bottom-cohort data with semi-supervised training and by incorporating a 2D cohort embedding into the acoustic model. The automatic, speaker-embedding approach yields larger and more scalable discovery of disparities than geodemographic methods. Both oversampling and cohort embeddings reduce the relative -gap between top and bottom cohorts (from about 56% to roughly 40%), with minimal impact on top-cohort accuracy; combining the two yields no additional gains due to imperfect alignment between demographic labels and acoustic characteristics. The results demonstrate practical, production-scale improvements in ASR fairness and point to future work on interpretable automatic cohorts and loss-based or adaptation-based mitigation strategies.

Abstract

As for other forms of AI, speech recognition has recently been examined with respect to performance disparities across different user cohorts. One approach to achieve fairness in speech recognition is to (1) identify speaker cohorts that suffer from subpar performance and (2) apply fairness mitigation measures targeting the cohorts discovered. In this paper, we report on initial findings with both discovery and mitigation of performance disparities using data from a product-scale AI assistant speech recognition system. We compare cohort discovery based on geographic and demographic information to a more scalable method that groups speakers without human labels, using speaker embedding technology. For fairness mitigation, we find that oversampling of underrepresented cohorts, as well as modeling speaker cohort membership by additional input variables, reduces the gap between top- and bottom-performing cohorts, without deteriorating overall recognition accuracy.
Paper Structure (18 sections, 3 equations, 5 tables)