FairLENS: Assessing Fairness in Law Enforcement Speech Recognition

Yicheng Wang; Mark Cusick; Mohamed Laila; Kate Puech; Zhengping Ji; Xia Hu; Michael Wilson; Noah Spitzer-Williams; Bryan Wheeler; Yasser Ibrahim

FairLENS: Assessing Fairness in Law Enforcement Speech Recognition

Yicheng Wang, Mark Cusick, Mohamed Laila, Kate Puech, Zhengping Ji, Xia Hu, Michael Wilson, Noah Spitzer-Williams, Bryan Wheeler, Yasser Ibrahim

TL;DR

This paper addresses fairness gaps in automatic speech recognition for law-enforcement use, where accuracy varies across demographic groups and acoustic conditions. It introduces FairLENS, an adaptable fairness evaluation framework with a $WER$-based disparity metric and a Wilcoxon signed-rank test, plus a new FairLENS dataset featuring self-identified demographics and diverse real-world scenarios. Applied to 1 open-source and 11 commercial ASR models, the analysis reveals heterogeneous fairness profiles and biases toward groups such as Asian, African American, Teens, and Southern accents, with performance degradation amplified by acoustic domain shifts. The work provides a principled tool for model selection in safety-critical contexts and highlights the need for more diverse and balanced training data to reduce demographic biases.

Abstract

Automatic speech recognition (ASR) techniques have become powerful tools, enhancing efficiency in law enforcement scenarios. To ensure fairness for demographic groups in different acoustic environments, ASR engines must be tested across a variety of speakers in realistic settings. However, describing the fairness discrepancies between models with confidence remains a challenge. Meanwhile, most public ASR datasets are insufficient to perform a satisfying fairness evaluation. To address the limitations, we built FairLENS - a systematic fairness evaluation framework. We propose a novel and adaptable evaluation method to examine the fairness disparity between different models. We also collected a fairness evaluation dataset covering multiple scenarios and demographic dimensions. Leveraging this framework, we conducted fairness assessments on 1 open-source and 11 commercially available state-of-the-art ASR models. Our results reveal that certain models exhibit more biases than others, serving as a fairness guideline for users to make informed choices when selecting ASR models for a given real-world scenario. We further explored model biases towards specific demographic groups and observed that shifts in the acoustic domain can lead to the emergence of new biases.

FairLENS: Assessing Fairness in Law Enforcement Speech Recognition

TL;DR

-based disparity metric and a Wilcoxon signed-rank test, plus a new FairLENS dataset featuring self-identified demographics and diverse real-world scenarios. Applied to 1 open-source and 11 commercial ASR models, the analysis reveals heterogeneous fairness profiles and biases toward groups such as Asian, African American, Teens, and Southern accents, with performance degradation amplified by acoustic domain shifts. The work provides a principled tool for model selection in safety-critical contexts and highlights the need for more diverse and balanced training data to reduce demographic biases.

Abstract

Paper Structure (24 sections, 9 equations, 9 figures, 2 tables)

This paper contains 24 sections, 9 equations, 9 figures, 2 tables.

Introduction
Related Work
Fairness Evaluation on ASR Models
Datasets for Fairness Evaluation
FairLENS: A Fairness Evaluation Framework
FairLENS Evaluation Method
FairLENS Dataset
Fairness Evaluation Results
Joint Performance-Fairness Evaluation
Biases toward Specific Demographic Groups
Acoustic Domain Shift
Conclusion & Future Work
Wilcoxon Signed-Rank Test
Adaptation to Another Fairness Evaluation Task
Exemplar Scripts
...and 9 more sections

Figures (9)

Figure 1: Data case distributions of the demographic groups in the FairLENS dataset.
Figure 2: Comparing the FairLENS dataset with others.
Figure 3: Joint performance-fairness evaluation results.
Figure 4: The performance of ASR models across different demographic subgroups on the solo speaker transcription dataset. We only shows those results that are worse than the average WER.
Figure 5: The performance of ASR models across different demographic groups on the indoor and outdoor dialogue datasets. The two dashed red lines in each graph represent the mean WER of the model on the indoor and outdoor dialogue datasets, respectively. The WER variations are shown on the bars.
...and 4 more figures

FairLENS: Assessing Fairness in Law Enforcement Speech Recognition

TL;DR

Abstract

FairLENS: Assessing Fairness in Law Enforcement Speech Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (9)