Table of Contents
Fetching ...

OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models

William Chen, Jinchuan Tian, Yifan Peng, Brian Yan, Chao-Han Huck Yang, Shinji Watanabe

TL;DR

OWLS addresses the challenge of understanding scaling in multilingual speech by introducing a publicly available suite of 13 Whisper-style ASR/ST models spanning 0.25B–18B parameters trained on up to 360K hours across 150 languages. The authors formalize and validate neural scaling laws for speech, showing that downstream WER/CER and BLEU scores can be predicted from model size, data, and compute, and that larger models substantially reduce errors for low-resource languages. Beyond static performance, OWLS reveals emergent abilities in large speech models, including orthographic understanding, improved code-switching handling, and semantically coherent mondegreens, as well as non-trivial in-context learning capabilities for unseen languages. The work provides a reproducible, open framework for probing scaling effects in multilingual speech, with practical implications for fairness, accessibility, and future research into large-scale speech systems and their emergent behaviors.

Abstract

Neural scaling laws offer valuable insights for designing robust sequence processing architectures. While these laws have been extensively characterized in other modalities, their behavior in speech remains comparatively underexplored. In this work, we introduce OWLS, an open-access, reproducible suite of multilingual speech recognition and translation models spanning 0.25B to 18B parameters, with the 18B version being the largest speech model, to the best of our knowledge. OWLS leverages up to 360K hours of public speech data across 150 languages, enabling a systematic investigation into how data, model, and compute scaling each influence performance in multilingual speech tasks. We use OWLS to derive neural scaling laws, showing how final performance can be reliably predicted when scaling. One of our key findings is that scaling enhances performance on low-resource languages/dialects, helping to mitigate bias and improve the accessibility of speech technologies. Finally, we show how OWLS can be used to power new research directions by discovering emergent abilities in large-scale speech models. Model checkpoints will be released on https://huggingface.co/collections/espnet/owls-scaling-laws-for-speech-recognition-and-translation-67ab7f991c194065f057ce8d for future studies.

OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models

TL;DR

OWLS addresses the challenge of understanding scaling in multilingual speech by introducing a publicly available suite of 13 Whisper-style ASR/ST models spanning 0.25B–18B parameters trained on up to 360K hours across 150 languages. The authors formalize and validate neural scaling laws for speech, showing that downstream WER/CER and BLEU scores can be predicted from model size, data, and compute, and that larger models substantially reduce errors for low-resource languages. Beyond static performance, OWLS reveals emergent abilities in large speech models, including orthographic understanding, improved code-switching handling, and semantically coherent mondegreens, as well as non-trivial in-context learning capabilities for unseen languages. The work provides a reproducible, open framework for probing scaling effects in multilingual speech, with practical implications for fairness, accessibility, and future research into large-scale speech systems and their emergent behaviors.

Abstract

Neural scaling laws offer valuable insights for designing robust sequence processing architectures. While these laws have been extensively characterized in other modalities, their behavior in speech remains comparatively underexplored. In this work, we introduce OWLS, an open-access, reproducible suite of multilingual speech recognition and translation models spanning 0.25B to 18B parameters, with the 18B version being the largest speech model, to the best of our knowledge. OWLS leverages up to 360K hours of public speech data across 150 languages, enabling a systematic investigation into how data, model, and compute scaling each influence performance in multilingual speech tasks. We use OWLS to derive neural scaling laws, showing how final performance can be reliably predicted when scaling. One of our key findings is that scaling enhances performance on low-resource languages/dialects, helping to mitigate bias and improve the accessibility of speech technologies. Finally, we show how OWLS can be used to power new research directions by discovering emergent abilities in large-scale speech models. Model checkpoints will be released on https://huggingface.co/collections/espnet/owls-scaling-laws-for-speech-recognition-and-translation-67ab7f991c194065f057ce8d for future studies.

Paper Structure

This paper contains 26 sections, 2 equations, 13 figures, 14 tables.

Figures (13)

  • Figure 1: Comparison of previous open models and our OWLS models (blue) by parameter count and training dataset size. Whisper whisper and Canary canary are trained on undisclosed data, while OWSM asru23-owsm and the presented OWLS use public data.
  • Figure 2: The effect of scaling model size on the 102 FLEURS languages, plotted as WER (or CER) versus available training data. Although WER/CER generally decreases with more training data, the relationship is only moderately correlated, as indicated by the R² values in the legend. Model performance is also influenced by domain alignment and orthographic transparency: for instance, more transparent languages (e.g., Spanish, Italian) often achieve lower error rates with less data than opaque languages (e.g., English, French).
  • Figure 3: The effect of model scaling on WER/CER on FLEURS. Languages are color-coded by the amount of training data. For readability, we only show the top-20 languages (by data amount) in our training corpus. We find that model scaling is consistently predictive of downstream WER/CER across languages. Scaling curves for other languages can be found in Figure \ref{['fig:scaling_param_appendix']} in the Appendix.
  • Figure 4: WERs on multi-domain English ASR by model size.
  • Figure 5: The evolution of FLEURS WER/CER for the top 20 languages by data size, as more training data is added for each language and given a fixed model capacity.Left: impact on WER/CER when scaling from 11K to 180K total hours, when all data is from the same distribution. Right: impact on WER/CER from adding in data from a new domain/distribution (YODAS), when further scaling from 180K to 360K total hours. Plots for more languages can be found in Figure \ref{['fig:wer_vs_data_appendix']} in the Appendix.
  • ...and 8 more figures