Table of Contents
Fetching ...

SCOREQ: Speech Quality Assessment with Contrastive Regression

Alessandro Ragano, Jan Skoglund, Andrew Hines

TL;DR

SCOREQ introduces a contrastive regression loss to address generalisation gaps in no-reference speech quality prediction, reframing MOS as a continuous target within a batch-all triplet framework. By replacing offline NSIM-based supervision with MOS-aware triplets and an adaptive margin, SCOREQ learns an ordered quality manifold that generalises across diverse domains and degradation types. The method supports two deployment modes—no-reference (NR) and non-matching reference (NMR)—and demonstrates consistent improvements over L2 baselines across 11 test sets and two architectures, including in speech synthesis scenarios. The work provides robust statistical validation and offers ready-to-use NR and NMR metrics, with broader implications for regression-based predictive modelling beyond speech quality.

Abstract

In this paper, we present SCOREQ, a novel approach for speech quality prediction. SCOREQ is a triplet loss function for contrastive regression that addresses the domain generalisation shortcoming exhibited by state of the art no-reference speech quality metrics. In the paper we: (i) illustrate the problem of L2 loss training failing at capturing the continuous nature of the mean opinion score (MOS) labels; (ii) demonstrate the lack of generalisation through a benchmarking evaluation across several speech domains; (iii) outline our approach and explore the impact of the architectural design decisions through incremental evaluation; (iv) evaluate the final model against state of the art models for a wide variety of data and domains. The results show that the lack of generalisation observed in state of the art speech quality metrics is addressed by SCOREQ. We conclude that using a triplet loss function for contrastive regression improves generalisation for speech quality prediction models but also has potential utility across a wide range of applications using regression-based predictive models.

SCOREQ: Speech Quality Assessment with Contrastive Regression

TL;DR

SCOREQ introduces a contrastive regression loss to address generalisation gaps in no-reference speech quality prediction, reframing MOS as a continuous target within a batch-all triplet framework. By replacing offline NSIM-based supervision with MOS-aware triplets and an adaptive margin, SCOREQ learns an ordered quality manifold that generalises across diverse domains and degradation types. The method supports two deployment modes—no-reference (NR) and non-matching reference (NMR)—and demonstrates consistent improvements over L2 baselines across 11 test sets and two architectures, including in speech synthesis scenarios. The work provides robust statistical validation and offers ready-to-use NR and NMR metrics, with broader implications for regression-based predictive modelling beyond speech quality.

Abstract

In this paper, we present SCOREQ, a novel approach for speech quality prediction. SCOREQ is a triplet loss function for contrastive regression that addresses the domain generalisation shortcoming exhibited by state of the art no-reference speech quality metrics. In the paper we: (i) illustrate the problem of L2 loss training failing at capturing the continuous nature of the mean opinion score (MOS) labels; (ii) demonstrate the lack of generalisation through a benchmarking evaluation across several speech domains; (iii) outline our approach and explore the impact of the architectural design decisions through incremental evaluation; (iv) evaluate the final model against state of the art models for a wide variety of data and domains. The results show that the lack of generalisation observed in state of the art speech quality metrics is addressed by SCOREQ. We conclude that using a triplet loss function for contrastive regression improves generalisation for speech quality prediction models but also has potential utility across a wide range of applications using regression-based predictive models.

Paper Structure

This paper contains 53 sections, 7 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Embeddings of L2 loss (a) vs SCOREQ (b) on TCD VOIP data harte2015tcd. Color shows quality labels (MOS) while markers identify the degradations. We compute the Normalised Mutual Information (NMI) between k-Means clusters and degradation labels, as well as the Pearson's Correlation (PC) between embedding distance with respect to random clean speech and MOS targets. Higher NMI indicates representations are clustered based on degradations while higher PC means representations are ordered with respect to MOS targets. Results indicate that the L2 loss embeddings tend to capture degradation information (NMI=0.39, PC=0.53) while SCOREQ quality (NMI=0.11, PC=0.80). See Appendix \ref{['appendix:figure1']} for more details.
  • Figure 2: Example of the SCOREQ loss using 3 samples in a training batch with corresponding MOS labels 4.5, 2.0, and 1.5. The distance matrix entries are defined as $D_{i,j,k} = ||f(g(\bm{x}_i))-f(g(\bm{x}_j))||_2 < ||f(g(\bm{x}_i)) - f(g(\bm{x}_k))||_2$. The intuition behind this contrastive loss for regression is shown in how the negative embeddings change in the anchor sample 1 where MOS is 4.5. We observe that the negative (sample 3 with MOS 1.5 ) will be further from the anchor with respect to sample 2. Indeed, because of the anchor 2 loss (where MOS is 2.0), sample 3 embeddings are pushed towards sample 2.
  • Figure 3: Example of how the mask $M(i,j,k)$ assigns 0 or 1. If the distance between the anchor and positive is lower than the distance between anchor and negative we consider it as a valid triplet. The inequality condition must also be verified i.e., $i \neq j \neq k$
  • Figure 4: SCOREQ modes. No-Reference (NR) mode is trained in 2 steps. We first pre-train the encoder $g(\cdot)$ with the SCOREQ loss. Next, we learn a linear layer (MOS head) that predicts an interpretable numerical MOS.
  • Figure 5: Domain mismatch. Each dot is a dataset, while horizontal lines represent the PC average for each domain shift (IN, ODS, ODM). PC values are the same of Table \ref{['tab:mismatch']}.
  • ...and 1 more figures