Table of Contents
Fetching ...

Anatomy of Industrial Scale Multilingual ASR

Francis McCann Ramirez, Luka Chkhetiani, Andrew Ehrenberg, Robert McHardy, Rami Botros, Yash Khare, Andrea Vanzo, Taufiquzzaman Peyash, Gabriel Oexle, Michael Liang, Ilya Sklyar, Enver Fakhan, Ahmed Etefy, Daniel McCrystal, Sam Flamini, Domenic Donato, Takuya Yoshioka

TL;DR

The paper tackles industrial-scale multilingual ASR by building Universal-1, a 600M-parameter Conformer encoder with an RNN-T decoder, pre-trained with BEST-RQ on 12.5M hours of unsupervised data and fine-tuned on 1.88M hours of supervised/pseudo-labeled data across English, Spanish, German, and French. It demonstrates competitive WER against much larger models (Whisper large, Canary-1B) while delivering major gains in code-switching robustness, latency, hallucination resistance, ambient-noise handling, and timestamp accuracy. The work emphasizes a system-centric, data-driven approach to real-world ASR deployment, providing insights into data composition, architectural choices, and inference strategies that matter at scale. Collectively, these findings advance practical multilingual ASR for high-throughput, real-world services and lay groundwork for future benchmarks and production-ready evaluation.

Abstract

This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed description of our model architecture, consisting of a full-context 600M-parameter Conformer encoder pre-trained with BEST-RQ and an RNN-T decoder fine-tuned jointly with the encoder. Our extensive evaluation demonstrates competitive word error rates (WERs) against larger and more computationally expensive models, such as Whisper large and Canary-1B. Furthermore, our architectural choices yield several key advantages, including an improved code-switching capability, a 5x inference speedup compared to an optimized Whisper baseline, a 30% reduction in hallucination rate on speech data, and a 90% reduction in ambient noise compared to Whisper, along with significantly improved time-stamp accuracy. Throughout this work, we adopt a system-centric approach to analyzing various aspects of fully-fledged ASR models to gain practically relevant insights useful for real-world services operating at scale.

Anatomy of Industrial Scale Multilingual ASR

TL;DR

The paper tackles industrial-scale multilingual ASR by building Universal-1, a 600M-parameter Conformer encoder with an RNN-T decoder, pre-trained with BEST-RQ on 12.5M hours of unsupervised data and fine-tuned on 1.88M hours of supervised/pseudo-labeled data across English, Spanish, German, and French. It demonstrates competitive WER against much larger models (Whisper large, Canary-1B) while delivering major gains in code-switching robustness, latency, hallucination resistance, ambient-noise handling, and timestamp accuracy. The work emphasizes a system-centric, data-driven approach to real-world ASR deployment, providing insights into data composition, architectural choices, and inference strategies that matter at scale. Collectively, these findings advance practical multilingual ASR for high-throughput, real-world services and lay groundwork for future benchmarks and production-ready evaluation.

Abstract

This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed description of our model architecture, consisting of a full-context 600M-parameter Conformer encoder pre-trained with BEST-RQ and an RNN-T decoder fine-tuned jointly with the encoder. Our extensive evaluation demonstrates competitive word error rates (WERs) against larger and more computationally expensive models, such as Whisper large and Canary-1B. Furthermore, our architectural choices yield several key advantages, including an improved code-switching capability, a 5x inference speedup compared to an optimized Whisper baseline, a 30% reduction in hallucination rate on speech data, and a 90% reduction in ambient noise compared to Whisper, along with significantly improved time-stamp accuracy. Throughout this work, we adopt a system-centric approach to analyzing various aspects of fully-fledged ASR models to gain practically relevant insights useful for real-world services operating at scale.
Paper Structure (40 sections, 8 figures, 7 tables)

This paper contains 40 sections, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Two-stage training procedure comprising self-supervised pre-training based on BEST-RQ followed by RNN-T fine-tuning.
  • Figure 2: Overview of the changes made for the sequential transducer loss. Encoder outputs are unrolled over the Time(T) dimension and preventing the creation of a high memory lattice of shape $B \times V \times T \times U$.
  • Figure 3: Code-switching experiment results using synthetic datasets, comparing our model and open-source models. The tags in the legend correspond to different ways of configuring the open-source models. ALD: Whisper's automatic language detection was used to predict the language token for each sample. EN: English was specified. ML: The non-English language of each dataset was specified.
  • Figure 4: Occurrences of five or more consecutive errors of each type per hour for Universal-1, Canary-1B and Whisper large-v3 models.
  • Figure 5: Rel. reduction of $N$ or more consecutive errors of each type per hour of Universal-1 in comparison to Whisper large-v3 and Canary-1B models for different $N$s.
  • ...and 3 more figures