"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most

Kaitlyn Zhou; Martijn Bartelds; Federico Bianchi; James Zou

"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most

Kaitlyn Zhou, Martijn Bartelds, Federico Bianchi, James Zou

TL;DR

This work exposes a gap between benchmark word-error-rate and real-world reliability in speech systems by studying street-name transcription, a high-stakes named-entity task. It analyzes 15 deployed models across diverse U.S. speakers, showing a $44\%$ average error rate and significant downstream costs in ride-hailing contexts. To address this, the authors propose a scalable synthetic-data augmentation pipeline using open-source TTS to generate diverse pronunciations of named entities, achieving about a $60\%$ relative improvement for non-English speakers with under $1{,}000$ synthetic samples. They release two public datasets (SF Streets and US Streets) and demonstrate substantial cross-language benefits, including improvements on out-of-distribution voices and street names. The work calls for explicit evaluation of named entities in deployment settings and offers a practical path toward reducing high-stakes transcription errors.

Abstract

Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (relative to base models) for non-English primary speakers. Our results highlight a critical gap between benchmark performance and real-world reliability in speech systems and demonstrate a simple, scalable path to reducing high-stakes transcription errors.

"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most

TL;DR

average error rate and significant downstream costs in ride-hailing contexts. To address this, the authors propose a scalable synthetic-data augmentation pipeline using open-source TTS to generate diverse pronunciations of named entities, achieving about a

relative improvement for non-English speakers with under

synthetic samples. They release two public datasets (SF Streets and US Streets) and demonstrate substantial cross-language benefits, including improvements on out-of-distribution voices and street names. The work calls for explicit evaluation of named entities in deployment settings and offers a practical path toward reducing high-stakes transcription errors.

Abstract

Paper Structure (24 sections, 12 figures, 8 tables)

This paper contains 24 sections, 12 figures, 8 tables.

Introduction
Background and Related Work
Dataset
SF Streets Dataset
Participants
U.S. Streets Dataset
Metrics: Transcription Error Rate
Street Name Recognition Is Challenging
Adding Context
Implications for Speech Model Evaluation
Exacerbated Errors for Non-English Primary Speakers
Finding
Estimating the Financial Impact
Mitigation via Synthetic Data
Failed Initial Attempt
...and 9 more sections

Figures (12)

Figure 1: Overview of Transcription Evaluation Pipeline
Figure 2: Limited English Proficiency Speakers in San Francisco. Original data from SF_LanguageDiversityData_SFgov.
Figure 3: Overall Transcription Accuracy on SF Streets for Models That Accept a Prompt
Figure 4: Transcription Accuracy by Language Groups Across All Model Families. 95% confidence intervals calculated via bootstrap resampling of 10,000 samples
Figure 5: Visualization of the Five Worst Mistakes (by distance) of a Non-English Speaker
...and 7 more figures

"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most

TL;DR

Abstract

"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most

Authors

TL;DR

Abstract

Table of Contents

Figures (12)