Earnings-21: A Practical Benchmark for ASR in the Wild
Miguel Del Rio, Natalie Delworth, Ryan Westerman, Michelle Huang, Nishchal Bhandari, Joseph Palakapilly, Quinten McNamara, Joshua Dong, Piotr Zelasko, Miguel Jette
TL;DR
Earnings-21 provides a realistic, open benchmark for ASR in the wild, focusing on domain-specific and entity-rich earnings-call audio. The paper introduces fstalign to produce flexible WER computations that account for text normalization and semantic equivalence, and it analyzes performance across four commercial systems, two internal Kaldi/ESPNet models, and a LibriSpeech baseline. key findings include significant variance by sector and named entity class, with DATE/TIME/ORDINAL easiest and FAC/ORG/PERSON hardest, and higher sampling rates generally improving transcription accuracy. The release of Earnings-21 and fstalign aims to bridge academic and industry evaluation and to drive research on robust, domain-aware ASR and NER.
Abstract
Commonly used speech corpora inadequately challenge academic and commercial ASR systems. In particular, speech corpora lack metadata needed for detailed analysis and WER measurement. In response, we present Earnings-21, a 39-hour corpus of earnings calls containing entity-dense speech from nine different financial sectors. This corpus is intended to benchmark ASR systems in the wild with special attention towards named entity recognition. We benchmark four commercial ASR models, two internal models built with open-source tools, and an open-source LibriSpeech model and discuss their differences in performance on Earnings-21. Using our recently released fstalign tool, we provide a candid analysis of each model's recognition capabilities under different partitions. Our analysis finds that ASR accuracy for certain NER categories is poor, presenting a significant impediment to transcript comprehension and usage. Earnings-21 bridges academic and commercial ASR system evaluation and enables further research on entity modeling and WER on real world audio.
