Table of Contents
Fetching ...

MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models

Wen-Chin Huang, Erica Cooper, Tomoki Toda

TL;DR

The paper tackles the generalization gap in subjective speech quality assessment by building MOS-Bench, a diverse multi-dataset benchmark, and introducing the SHEET toolkit for reproducible SSQA experiments. It proposes new cross-dataset metrics (best score difference and best score ratio) and demonstrates that training on multiple datasets improves out-of-domain generalization while preserving in-domain accuracy. Through latent-space visualizations, it provides a diagnostic view of how models cover various data distributions and explains generalization patterns. An intriguing finding is that non-synthetic datasets (e.g., NISQA, PSTN) can generalize well to synthetic domains, suggesting a potential shift in data collection strategies. Together, MOS-Bench and SHEET offer a practical framework for advancing robust, generalizable SSQA methodologies.

Abstract

Subjective speech quality assessment (SSQA) is critical for evaluating speech samples as perceived by human listeners. While model-based SSQA has enjoyed great success thanks to the development of deep neural networks (DNNs), generalization remains a key challenge, especially for unseen, out-of-domain data. To benchmark the generalization abilities of SSQA models, we present MOS-Bench, a diverse collection of datasets. In addition, we also introduce SHEET, an open-source toolkit containing complete recipes to conduct SSQA experiments. We provided benchmark results for MOS-Bench, and we also explored multi-dataset training to enhance generalization. Additionally, we proposed a new performance metric, best score difference/ratio, and used latent space visualizations to explain model behavior, offering valuable insights for future research.

MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models

TL;DR

The paper tackles the generalization gap in subjective speech quality assessment by building MOS-Bench, a diverse multi-dataset benchmark, and introducing the SHEET toolkit for reproducible SSQA experiments. It proposes new cross-dataset metrics (best score difference and best score ratio) and demonstrates that training on multiple datasets improves out-of-domain generalization while preserving in-domain accuracy. Through latent-space visualizations, it provides a diagnostic view of how models cover various data distributions and explains generalization patterns. An intriguing finding is that non-synthetic datasets (e.g., NISQA, PSTN) can generalize well to synthetic domains, suggesting a potential shift in data collection strategies. Together, MOS-Bench and SHEET offer a practical framework for advancing robust, generalizable SSQA methodologies.

Abstract

Subjective speech quality assessment (SSQA) is critical for evaluating speech samples as perceived by human listeners. While model-based SSQA has enjoyed great success thanks to the development of deep neural networks (DNNs), generalization remains a key challenge, especially for unseen, out-of-domain data. To benchmark the generalization abilities of SSQA models, we present MOS-Bench, a diverse collection of datasets. In addition, we also introduce SHEET, an open-source toolkit containing complete recipes to conduct SSQA experiments. We provided benchmark results for MOS-Bench, and we also explored multi-dataset training to enhance generalization. Additionally, we proposed a new performance metric, best score difference/ratio, and used latent space visualizations to explain model behavior, offering valuable insights for future research.

Paper Structure

This paper contains 41 sections, 4 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Main models and inference methods supported in SHEET, the open-source toolkit developed.
  • Figure 2: Distribution plot of an SSL-MOS model trained on NISQA and tested on VMC'23 track 1a.
  • Figure 3: Best score difference and best score ratio result for single dataset training experiments. For best score difference, the more saturated the color, the closer the score to 0. For best score ratio, the more saturated the color, the closer the score to 100%.
  • Figure 4: SSL embedding visualization of SSQA models trained on one single dataset. The dots are colored using set labels. Black dots indicate training samples.
  • Figure 5: SSL embedding visualization of SSQA models trained on one single dataset. The dots are colored using synthetic(orange)/non-synthetic (yellow) labels. Black dots indicate training samples.
  • ...and 2 more figures