Table of Contents
Fetching ...

FairLENS: Assessing Fairness in Law Enforcement Speech Recognition

Yicheng Wang, Mark Cusick, Mohamed Laila, Kate Puech, Zhengping Ji, Xia Hu, Michael Wilson, Noah Spitzer-Williams, Bryan Wheeler, Yasser Ibrahim

TL;DR

This paper addresses fairness gaps in automatic speech recognition for law-enforcement use, where accuracy varies across demographic groups and acoustic conditions. It introduces FairLENS, an adaptable fairness evaluation framework with a $WER$-based disparity metric and a Wilcoxon signed-rank test, plus a new FairLENS dataset featuring self-identified demographics and diverse real-world scenarios. Applied to 1 open-source and 11 commercial ASR models, the analysis reveals heterogeneous fairness profiles and biases toward groups such as Asian, African American, Teens, and Southern accents, with performance degradation amplified by acoustic domain shifts. The work provides a principled tool for model selection in safety-critical contexts and highlights the need for more diverse and balanced training data to reduce demographic biases.

Abstract

Automatic speech recognition (ASR) techniques have become powerful tools, enhancing efficiency in law enforcement scenarios. To ensure fairness for demographic groups in different acoustic environments, ASR engines must be tested across a variety of speakers in realistic settings. However, describing the fairness discrepancies between models with confidence remains a challenge. Meanwhile, most public ASR datasets are insufficient to perform a satisfying fairness evaluation. To address the limitations, we built FairLENS - a systematic fairness evaluation framework. We propose a novel and adaptable evaluation method to examine the fairness disparity between different models. We also collected a fairness evaluation dataset covering multiple scenarios and demographic dimensions. Leveraging this framework, we conducted fairness assessments on 1 open-source and 11 commercially available state-of-the-art ASR models. Our results reveal that certain models exhibit more biases than others, serving as a fairness guideline for users to make informed choices when selecting ASR models for a given real-world scenario. We further explored model biases towards specific demographic groups and observed that shifts in the acoustic domain can lead to the emergence of new biases.

FairLENS: Assessing Fairness in Law Enforcement Speech Recognition

TL;DR

This paper addresses fairness gaps in automatic speech recognition for law-enforcement use, where accuracy varies across demographic groups and acoustic conditions. It introduces FairLENS, an adaptable fairness evaluation framework with a -based disparity metric and a Wilcoxon signed-rank test, plus a new FairLENS dataset featuring self-identified demographics and diverse real-world scenarios. Applied to 1 open-source and 11 commercial ASR models, the analysis reveals heterogeneous fairness profiles and biases toward groups such as Asian, African American, Teens, and Southern accents, with performance degradation amplified by acoustic domain shifts. The work provides a principled tool for model selection in safety-critical contexts and highlights the need for more diverse and balanced training data to reduce demographic biases.

Abstract

Automatic speech recognition (ASR) techniques have become powerful tools, enhancing efficiency in law enforcement scenarios. To ensure fairness for demographic groups in different acoustic environments, ASR engines must be tested across a variety of speakers in realistic settings. However, describing the fairness discrepancies between models with confidence remains a challenge. Meanwhile, most public ASR datasets are insufficient to perform a satisfying fairness evaluation. To address the limitations, we built FairLENS - a systematic fairness evaluation framework. We propose a novel and adaptable evaluation method to examine the fairness disparity between different models. We also collected a fairness evaluation dataset covering multiple scenarios and demographic dimensions. Leveraging this framework, we conducted fairness assessments on 1 open-source and 11 commercially available state-of-the-art ASR models. Our results reveal that certain models exhibit more biases than others, serving as a fairness guideline for users to make informed choices when selecting ASR models for a given real-world scenario. We further explored model biases towards specific demographic groups and observed that shifts in the acoustic domain can lead to the emergence of new biases.
Paper Structure (24 sections, 9 equations, 9 figures, 2 tables)

This paper contains 24 sections, 9 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Data case distributions of the demographic groups in the FairLENS dataset.
  • Figure 2: Comparing the FairLENS dataset with others.
  • Figure 3: Joint performance-fairness evaluation results.
  • Figure 4: The performance of ASR models across different demographic subgroups on the solo speaker transcription dataset. We only shows those results that are worse than the average WER.
  • Figure 5: The performance of ASR models across different demographic groups on the indoor and outdoor dialogue datasets. The two dashed red lines in each graph represent the mean WER of the model on the indoor and outdoor dialogue datasets, respectively. The WER variations are shown on the bars.
  • ...and 4 more figures