Table of Contents
Fetching ...

Speech Robust Bench: A Robustness Benchmark For Speech Recognition

Muhammad A. Shah, David Solans Noguero, Mikko A. Heikkila, Bhiksha Raj, Nicolas Kourtellis

TL;DR

Speech Robust Bench (SRB) introduces a robustness benchmark for automatic speech recognition by aggregating $114$ diverse perturbations across four severity levels. The paper evaluates six contemporary ASR models, including Whisper, wav2vec2, and HuBERT variants, using the LibriSpeech clean test subset and a Spanish MLS-based dataset, and introduces a comprehensive metric suite comprising $ER$, $ERD$, $CE$, $CED$, and $EDV$ to capture utility, degradation, and stability. Key findings show that larger models and certain training strategies tend to improve robustness, but notable disparities persist across language and speaker demographics (e.g., English vs. Spanish, male vs. female). The work demonstrates that CE/CED provide nuanced, difficulty-aware comparisons of robustness, revealing that larger parameter counts do not universally guarantee robustness across all corruption types. Overall, SRB offers a rigorous, reproducible framework for diagnosing and guiding the development of robust, fair ASR systems in real-world deployments.

Abstract

As Automatic Speech Recognition (ASR) models become ever more pervasive, it is important to ensure that they make reliable predictions under corruptions present in the physical and digital world. We propose Speech Robust Bench (SRB), a comprehensive benchmark for evaluating the robustness of ASR models to diverse corruptions. SRB is composed of 114 input perturbations which simulate an heterogeneous range of corruptions that ASR models may encounter when deployed in the wild. We use SRB to evaluate the robustness of several state-of-the-art ASR models and observe that model size and certain modeling choices such as the use of discrete representations, or self-training appear to be conducive to robustness. We extend this analysis to measure the robustness of ASR models on data from various demographic subgroups, namely English and Spanish speakers, and males and females. Our results revealed noticeable disparities in the model's robustness across subgroups. We believe that SRB will significantly facilitate future research towards robust ASR models, by making it easier to conduct comprehensive and comparable robustness evaluations.

Speech Robust Bench: A Robustness Benchmark For Speech Recognition

TL;DR

Speech Robust Bench (SRB) introduces a robustness benchmark for automatic speech recognition by aggregating diverse perturbations across four severity levels. The paper evaluates six contemporary ASR models, including Whisper, wav2vec2, and HuBERT variants, using the LibriSpeech clean test subset and a Spanish MLS-based dataset, and introduces a comprehensive metric suite comprising , , , , and to capture utility, degradation, and stability. Key findings show that larger models and certain training strategies tend to improve robustness, but notable disparities persist across language and speaker demographics (e.g., English vs. Spanish, male vs. female). The work demonstrates that CE/CED provide nuanced, difficulty-aware comparisons of robustness, revealing that larger parameter counts do not universally guarantee robustness across all corruption types. Overall, SRB offers a rigorous, reproducible framework for diagnosing and guiding the development of robust, fair ASR systems in real-world deployments.

Abstract

As Automatic Speech Recognition (ASR) models become ever more pervasive, it is important to ensure that they make reliable predictions under corruptions present in the physical and digital world. We propose Speech Robust Bench (SRB), a comprehensive benchmark for evaluating the robustness of ASR models to diverse corruptions. SRB is composed of 114 input perturbations which simulate an heterogeneous range of corruptions that ASR models may encounter when deployed in the wild. We use SRB to evaluate the robustness of several state-of-the-art ASR models and observe that model size and certain modeling choices such as the use of discrete representations, or self-training appear to be conducive to robustness. We extend this analysis to measure the robustness of ASR models on data from various demographic subgroups, namely English and Spanish speakers, and males and females. Our results revealed noticeable disparities in the model's robustness across subgroups. We believe that SRB will significantly facilitate future research towards robust ASR models, by making it easier to conduct comprehensive and comparable robustness evaluations.
Paper Structure (33 sections, 8 equations, 17 figures, 13 tables)

This paper contains 33 sections, 8 equations, 17 figures, 13 tables.

Figures (17)

  • Figure 1: CCE and CCED as the severity of the augmentation is increased.
  • Figure 2: CCE and CCED of the models on different augmentations, averaged over all severities.
  • Figure 3: The Character ERD and EDV of various models under Gaussian noise of SNR 10 dB
  • Figure 4: Averaged CCE and CCED plotted against the year of release.
  • Figure 5: Averaged CCE and CCED plotted against the number of model parameters (in millions).
  • ...and 12 more figures