Table of Contents
Fetching ...

Beyond Word Error Rate: Auditing the Diversity Tax in Speech Recognition through Dataset Cartography

Ting-Hui Cheng, Line H. Clemmensen, Sneha Das

TL;DR

The sample difficulty index (SDI), a novel metric that quantifies how intrinsic demographic and acoustic factors drive model failure, is introduced to enable rigorous model auditing and exposes hidden systemic biases and inter-model disagreements that WER ignores.

Abstract

Automatic speech recognition (ASR) systems are predominantly evaluated using the Word Error Rate (WER). However, raw token-level metrics fail to capture semantic fidelity and routinely obscures the `diversity tax', the disproportionate burden on marginalized and atypical speaker due to systematic recognition failures. In this paper, we explore the limitations of relying solely on lexical counts by systematically evaluating a broader class of non-linear and semantic metrics. To enable rigorous model auditing, we introduce the sample difficulty index (SDI), a novel metric that quantifies how intrinsic demographic and acoustic factors drive model failure. By mapping SDI on data cartography, we demonstrate that metrics EmbER and SemDist expose hidden systemic biases and inter-model disagreements that WER ignores. Finally, our findings are the first steps towards a robust audit framework for prospective safety analysis, empowering developers to audit and mitigate ASR disparities prior to deployment.

Beyond Word Error Rate: Auditing the Diversity Tax in Speech Recognition through Dataset Cartography

TL;DR

The sample difficulty index (SDI), a novel metric that quantifies how intrinsic demographic and acoustic factors drive model failure, is introduced to enable rigorous model auditing and exposes hidden systemic biases and inter-model disagreements that WER ignores.

Abstract

Automatic speech recognition (ASR) systems are predominantly evaluated using the Word Error Rate (WER). However, raw token-level metrics fail to capture semantic fidelity and routinely obscures the `diversity tax', the disproportionate burden on marginalized and atypical speaker due to systematic recognition failures. In this paper, we explore the limitations of relying solely on lexical counts by systematically evaluating a broader class of non-linear and semantic metrics. To enable rigorous model auditing, we introduce the sample difficulty index (SDI), a novel metric that quantifies how intrinsic demographic and acoustic factors drive model failure. By mapping SDI on data cartography, we demonstrate that metrics EmbER and SemDist expose hidden systemic biases and inter-model disagreements that WER ignores. Finally, our findings are the first steps towards a robust audit framework for prospective safety analysis, empowering developers to audit and mitigate ASR disparities prior to deployment.
Paper Structure (9 sections, 3 equations, 5 figures, 1 table)

This paper contains 9 sections, 3 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Characteristics of datasets as averages or ratios.
  • Figure 2: A) Mean and std error of $\beta, \alpha$ coefficients from the fixed effect (FE) model for the demographic and acoustic characteristics. B) Mean Coefficients from FE of models and datasets; C) Summary of the FE model fit statistics across the six metrics.
  • Figure 3: Latent space mapping of ASR performance using Principal Component Analysis. Axes are truncated to highlight the primary variance clusters.
  • Figure 4: Cartography plots mapping mean error ($\mu$) against inter-model disagreement ($\sigma$), colored by SDI decile (1 = Easiest, 10 = Hardest). SDI Deciles divide the dataset's speech samples into ten equal tiers based on their calculated intrinsic difficulty, ranging from 1 (the easiest samples for models to transcribe) to 10 (the hardest).
  • Figure 5: Cartography plots of mean EmbER error ($\mu$) against inter-model disagreement ($\sigma$) across ASR systems using the EmbER metric. The $\mu$ reflects the overall recognition difficulty of each sample, while $\sigma$ captures the extent of ambiguity, indicating how consistently different ASR models perform on the same utterance.