Anatomy of Industrial Scale Multilingual ASR
Francis McCann Ramirez, Luka Chkhetiani, Andrew Ehrenberg, Robert McHardy, Rami Botros, Yash Khare, Andrea Vanzo, Taufiquzzaman Peyash, Gabriel Oexle, Michael Liang, Ilya Sklyar, Enver Fakhan, Ahmed Etefy, Daniel McCrystal, Sam Flamini, Domenic Donato, Takuya Yoshioka
TL;DR
The paper tackles industrial-scale multilingual ASR by building Universal-1, a 600M-parameter Conformer encoder with an RNN-T decoder, pre-trained with BEST-RQ on 12.5M hours of unsupervised data and fine-tuned on 1.88M hours of supervised/pseudo-labeled data across English, Spanish, German, and French. It demonstrates competitive WER against much larger models (Whisper large, Canary-1B) while delivering major gains in code-switching robustness, latency, hallucination resistance, ambient-noise handling, and timestamp accuracy. The work emphasizes a system-centric, data-driven approach to real-world ASR deployment, providing insights into data composition, architectural choices, and inference strategies that matter at scale. Collectively, these findings advance practical multilingual ASR for high-throughput, real-world services and lay groundwork for future benchmarks and production-ready evaluation.
Abstract
This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed description of our model architecture, consisting of a full-context 600M-parameter Conformer encoder pre-trained with BEST-RQ and an RNN-T decoder fine-tuned jointly with the encoder. Our extensive evaluation demonstrates competitive word error rates (WERs) against larger and more computationally expensive models, such as Whisper large and Canary-1B. Furthermore, our architectural choices yield several key advantages, including an improved code-switching capability, a 5x inference speedup compared to an optimized Whisper baseline, a 30% reduction in hallucination rate on speech data, and a 90% reduction in ambient noise compared to Whisper, along with significantly improved time-stamp accuracy. Throughout this work, we adopt a system-centric approach to analyzing various aspects of fully-fledged ASR models to gain practically relevant insights useful for real-world services operating at scale.
