Table of Contents
Fetching ...

"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most

Kaitlyn Zhou, Martijn Bartelds, Federico Bianchi, James Zou

TL;DR

This work exposes a gap between benchmark word-error-rate and real-world reliability in speech systems by studying street-name transcription, a high-stakes named-entity task. It analyzes 15 deployed models across diverse U.S. speakers, showing a $44\%$ average error rate and significant downstream costs in ride-hailing contexts. To address this, the authors propose a scalable synthetic-data augmentation pipeline using open-source TTS to generate diverse pronunciations of named entities, achieving about a $60\%$ relative improvement for non-English speakers with under $1{,}000$ synthetic samples. They release two public datasets (SF Streets and US Streets) and demonstrate substantial cross-language benefits, including improvements on out-of-distribution voices and street names. The work calls for explicit evaluation of named entities in deployment settings and offers a practical path toward reducing high-stakes transcription errors.

Abstract

Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (relative to base models) for non-English primary speakers. Our results highlight a critical gap between benchmark performance and real-world reliability in speech systems and demonstrate a simple, scalable path to reducing high-stakes transcription errors.

"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most

TL;DR

This work exposes a gap between benchmark word-error-rate and real-world reliability in speech systems by studying street-name transcription, a high-stakes named-entity task. It analyzes 15 deployed models across diverse U.S. speakers, showing a average error rate and significant downstream costs in ride-hailing contexts. To address this, the authors propose a scalable synthetic-data augmentation pipeline using open-source TTS to generate diverse pronunciations of named entities, achieving about a relative improvement for non-English speakers with under synthetic samples. They release two public datasets (SF Streets and US Streets) and demonstrate substantial cross-language benefits, including improvements on out-of-distribution voices and street names. The work calls for explicit evaluation of named entities in deployment settings and offers a practical path toward reducing high-stakes transcription errors.

Abstract

Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (relative to base models) for non-English primary speakers. Our results highlight a critical gap between benchmark performance and real-world reliability in speech systems and demonstrate a simple, scalable path to reducing high-stakes transcription errors.
Paper Structure (24 sections, 12 figures, 8 tables)

This paper contains 24 sections, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Overview of Transcription Evaluation Pipeline
  • Figure 2: Limited English Proficiency Speakers in San Francisco. Original data from SF_LanguageDiversityData_SFgov.
  • Figure 3: Overall Transcription Accuracy on SF Streets for Models That Accept a Prompt
  • Figure 4: Transcription Accuracy by Language Groups Across All Model Families. 95% confidence intervals calculated via bootstrap resampling of 10,000 samples
  • Figure 5: Visualization of the Five Worst Mistakes (by distance) of a Non-English Speaker
  • ...and 7 more figures