Table of Contents
Fetching ...

Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual Speech Recognition Evaluation

Vaibhav Srivastav, Steven Zheng, Eric Bezzam, Eustache Le Bihan, Adel Moumen, Sanchit Gandhi

TL;DR

Open ASR Leaderboard tackles the lack of reproducibility and multilingual coverage in ASR evaluation by offering a fully reproducible benchmark across 60+ models and 11 datasets, with standardized text normalization and dual metrics WER and RTFx. The platform exposes datasets via Hugging Face and the codebase is open-source to enable easy addition of new models and data. Key findings show Conformer + LLM decoders achieve the best English WER but with high latency, while CTC/TDT decoders provide superior efficiency at some accuracy cost, and Whisper-based encoders better English accuracy at the expense of multilingual breadth. The results highlight trade-offs between specialization and coverage, and emphasize the value of open, transparent evaluation for guiding deployment decisions.

Abstract

Despite rapid progress, ASR evaluation remains saturated with short-form English, and efficiency is rarely reported. We present the Open ASR Leaderboard, a fully reproducible benchmark and interactive leaderboard comparing 60+ open-source and proprietary systems across 11 datasets, including a dedicated multilingual track. We standardize text normalization and report both word error rate (WER) and inverse real-time factor (RTFx), enabling fair accuracy-efficiency comparisons. For English transcription, Conformer encoders paired with LLM decoders achieve the best average WER but are slower, while CTC and TDT decoders deliver much better RTFx, making them attractive for long-form and offline use. Whisper-derived encoders fine-tuned for English improve accuracy but often trade off multilingual coverage. All code and dataset loaders are open-sourced to support transparent, extensible evaluation.

Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual Speech Recognition Evaluation

TL;DR

Open ASR Leaderboard tackles the lack of reproducibility and multilingual coverage in ASR evaluation by offering a fully reproducible benchmark across 60+ models and 11 datasets, with standardized text normalization and dual metrics WER and RTFx. The platform exposes datasets via Hugging Face and the codebase is open-source to enable easy addition of new models and data. Key findings show Conformer + LLM decoders achieve the best English WER but with high latency, while CTC/TDT decoders provide superior efficiency at some accuracy cost, and Whisper-based encoders better English accuracy at the expense of multilingual breadth. The results highlight trade-offs between specialization and coverage, and emphasize the value of open, transparent evaluation for guiding deployment decisions.

Abstract

Despite rapid progress, ASR evaluation remains saturated with short-form English, and efficiency is rarely reported. We present the Open ASR Leaderboard, a fully reproducible benchmark and interactive leaderboard comparing 60+ open-source and proprietary systems across 11 datasets, including a dedicated multilingual track. We standardize text normalization and report both word error rate (WER) and inverse real-time factor (RTFx), enabling fair accuracy-efficiency comparisons. For English transcription, Conformer encoders paired with LLM decoders achieve the best average WER but are slower, while CTC and TDT decoders deliver much better RTFx, making them attractive for long-form and offline use. Whisper-derived encoders fine-tuned for English improve accuracy but often trade off multilingual coverage. All code and dataset loaders are open-sourced to support transparent, extensible evaluation.

Paper Structure

This paper contains 9 sections, 1 equation, 4 tables.