CORN: Co-Trained Full- And No-Reference Speech Quality Assessment

Pranay Manocha; Donald Williamson; Adam Finkelstein

CORN: Co-Trained Full- And No-Reference Speech Quality Assessment

Pranay Manocha, Donald Williamson, Adam Finkelstein

TL;DR

We address robust speech quality assessment without heavy human labeling by proposing CORN, a co-trained FR and NR framework with a shared base that jointly predicts FR scores $f_{ij}$ and NR scores $n_i$ using losses tied to $SI$-SDR$ and related targets. The approach demonstrates that incorporating the NR loss during training stabilizes the FR predictor and that NR models benefit from reference information during training, yielding two independently useful predictors that outperform independently trained baselines. Embedding analyses show improved content-invariance and retrieval performance, indicating that the shared representation captures quality-relevant structure beyond content. This framework reduces reliance on human ratings while delivering robust, reference-flexible speech quality evaluation suitable for diverse deployment scenarios.

Abstract

Perceptual evaluation constitutes a crucial aspect of various audio-processing tasks. Full reference (FR) or similarity-based metrics rely on high-quality reference recordings, to which lower-quality or corrupted versions of the recording may be compared for evaluation. In contrast, no-reference (NR) metrics evaluate a recording without relying on a reference. Both the FR and NR approaches exhibit advantages and drawbacks relative to each other. In this paper, we present a novel framework called CORN that amalgamates these dual approaches, concurrently training both FR and NR models together. After training, the models can be applied independently. We evaluate CORN by predicting several common objective metrics and across two different architectures. The NR model trained using CORN has access to a reference recording during training, and thus, as one would expect, it consistently outperforms baseline NR models trained independently. Perhaps even more remarkable is that the CORN FR model also outperforms its baseline counterpart, even though it relies on the same training data and the same model architecture. Thus, a single training regime produces two independently useful models, each outperforming independently trained models

CORN: Co-Trained Full- And No-Reference Speech Quality Assessment

TL;DR

We address robust speech quality assessment without heavy human labeling by proposing CORN, a co-trained FR and NR framework with a shared base that jointly predicts FR scores

and NR scores

using losses tied to

-SDR$ and related targets. The approach demonstrates that incorporating the NR loss during training stabilizes the FR predictor and that NR models benefit from reference information during training, yielding two independently useful predictors that outperform independently trained baselines. Embedding analyses show improved content-invariance and retrieval performance, indicating that the shared representation captures quality-relevant structure beyond content. This framework reduces reliance on human ratings while delivering robust, reference-flexible speech quality evaluation suitable for diverse deployment scenarios.

Abstract

Paper Structure (16 sections, 2 equations, 1 figure, 2 tables)

This paper contains 16 sections, 2 equations, 1 figure, 2 tables.

Introduction
Related Work
Full-reference metrics
No-reference metrics
The CORN Framework
Framework Design and Model Architectures
Training Tasks and Loss Functions
Training procedure
Inference
Experimental Setup
Datasets and training
Baselines
Results
Performance across metrics and architectures
Evaluation of the embedding
...and 1 more sections

Figures (1)

Figure 1: Proposed CORN training framework with (a) Full-reference (FR, in green) and (b) No-Reference (NR, in red) models. Co-training (a) and (b) together -- the network architecture (c) of the base model $\mathbf{B}$ is identical in each instance in the FR and NR models, and has shared weights indicated by the dotted lines. In (a) and (b), task-specific output heads $\mathbf{H}_f$ and $\mathbf{H}_n$ predict the FR and NR scores $f_{ij}$ and $n_i$. In FR the embedding $e_i$ of recording $x_i$ is identical to its counterpart in NR; however only in FR it is concatenated with the embedding $e_j$ of a reference recording $r_j$ before passing along to the output head (Sections \ref{['ssec: 3.1']} and \ref{['subsec3.2']}).

CORN: Co-Trained Full- And No-Reference Speech Quality Assessment

TL;DR

Abstract

CORN: Co-Trained Full- And No-Reference Speech Quality Assessment

Authors

TL;DR

Abstract

Table of Contents

Figures (1)