"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most
Kaitlyn Zhou, Martijn Bartelds, Federico Bianchi, James Zou
TL;DR
This work exposes a gap between benchmark word-error-rate and real-world reliability in speech systems by studying street-name transcription, a high-stakes named-entity task. It analyzes 15 deployed models across diverse U.S. speakers, showing a $44\%$ average error rate and significant downstream costs in ride-hailing contexts. To address this, the authors propose a scalable synthetic-data augmentation pipeline using open-source TTS to generate diverse pronunciations of named entities, achieving about a $60\%$ relative improvement for non-English speakers with under $1{,}000$ synthetic samples. They release two public datasets (SF Streets and US Streets) and demonstrate substantial cross-language benefits, including improvements on out-of-distribution voices and street names. The work calls for explicit evaluation of named entities in deployment settings and offers a practical path toward reducing high-stakes transcription errors.
Abstract
Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (relative to base models) for non-English primary speakers. Our results highlight a critical gap between benchmark performance and real-world reliability in speech systems and demonstrate a simple, scalable path to reducing high-stakes transcription errors.
