MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models
Wen-Chin Huang, Erica Cooper, Tomoki Toda
TL;DR
The paper tackles the generalization gap in subjective speech quality assessment by building MOS-Bench, a diverse multi-dataset benchmark, and introducing the SHEET toolkit for reproducible SSQA experiments. It proposes new cross-dataset metrics (best score difference and best score ratio) and demonstrates that training on multiple datasets improves out-of-domain generalization while preserving in-domain accuracy. Through latent-space visualizations, it provides a diagnostic view of how models cover various data distributions and explains generalization patterns. An intriguing finding is that non-synthetic datasets (e.g., NISQA, PSTN) can generalize well to synthetic domains, suggesting a potential shift in data collection strategies. Together, MOS-Bench and SHEET offer a practical framework for advancing robust, generalizable SSQA methodologies.
Abstract
Subjective speech quality assessment (SSQA) is critical for evaluating speech samples as perceived by human listeners. While model-based SSQA has enjoyed great success thanks to the development of deep neural networks (DNNs), generalization remains a key challenge, especially for unseen, out-of-domain data. To benchmark the generalization abilities of SSQA models, we present MOS-Bench, a diverse collection of datasets. In addition, we also introduce SHEET, an open-source toolkit containing complete recipes to conduct SSQA experiments. We provided benchmark results for MOS-Bench, and we also explored multi-dataset training to enhance generalization. Additionally, we proposed a new performance metric, best score difference/ratio, and used latent space visualizations to explain model behavior, offering valuable insights for future research.
