SCOREQ: Speech Quality Assessment with Contrastive Regression
Alessandro Ragano, Jan Skoglund, Andrew Hines
TL;DR
SCOREQ introduces a contrastive regression loss to address generalisation gaps in no-reference speech quality prediction, reframing MOS as a continuous target within a batch-all triplet framework. By replacing offline NSIM-based supervision with MOS-aware triplets and an adaptive margin, SCOREQ learns an ordered quality manifold that generalises across diverse domains and degradation types. The method supports two deployment modes—no-reference (NR) and non-matching reference (NMR)—and demonstrates consistent improvements over L2 baselines across 11 test sets and two architectures, including in speech synthesis scenarios. The work provides robust statistical validation and offers ready-to-use NR and NMR metrics, with broader implications for regression-based predictive modelling beyond speech quality.
Abstract
In this paper, we present SCOREQ, a novel approach for speech quality prediction. SCOREQ is a triplet loss function for contrastive regression that addresses the domain generalisation shortcoming exhibited by state of the art no-reference speech quality metrics. In the paper we: (i) illustrate the problem of L2 loss training failing at capturing the continuous nature of the mean opinion score (MOS) labels; (ii) demonstrate the lack of generalisation through a benchmarking evaluation across several speech domains; (iii) outline our approach and explore the impact of the architectural design decisions through incremental evaluation; (iv) evaluate the final model against state of the art models for a wide variety of data and domains. The results show that the lack of generalisation observed in state of the art speech quality metrics is addressed by SCOREQ. We conclude that using a triplet loss function for contrastive regression improves generalisation for speech quality prediction models but also has potential utility across a wide range of applications using regression-based predictive models.
