Table of Contents
Fetching ...

Pitch-and-Spectrum-Aware Singing Quality Assessment with Bias Correction and Model Fusion

Yu-Fei Shi, Yang Ai, Ye-Xin Lu, Hui-Peng Du, Zhen-Hua Ling

TL;DR

The paper tackles automatic singing quality assessment (MOS) for SVS/SVC outputs by introducing PS-SQA, a pitch- and spectrum-aware enhancement of SSL-based MOS predictors. PS-SQA integrates pitch cues via compressed-pitch and pitch-histogram conditioning and spectral cues via a non-quantized APCodec-based representation, augmented with a bias-correction branch and model fusion to leverage predictor diversity. Trained on SingMOS with the loss $L_1 = \|\hat{y}-y\|_1$, PS-SQA achieves superior system-level SRCC and related metrics, outperforming competing track-2 systems and closely approaching or surpassing the official baseline. This approach demonstrates that explicit melodic and spectral information, together with imbalance-aware training and ensembling, substantially improves automatic singing quality evaluation with practical implications for SVS/SVC deployment.

Abstract

We participated in track 2 of the VoiceMOS Challenge 2024, which aimed to predict the mean opinion score (MOS) of singing samples. Our submission secured the first place among all participating teams, excluding the official baseline. In this paper, we further improve our submission and propose a novel Pitch-and-Spectrum-aware Singing Quality Assessment (PS-SQA) method. The PS-SQA is designed based on the self-supervised-learning (SSL) MOS predictor, incorporating singing pitch and spectral information, which are extracted using pitch histogram and non-quantized neural codec, respectively. Additionally, the PS-SQA introduces a bias correction strategy to address prediction biases caused by low-resource training samples, and employs model fusion technology to further enhance prediction accuracy. Experimental results confirm that our proposed PS-SQA significantly outperforms all competing systems across all system-level metrics, confirming its strong sing quality assessment capabilities.

Pitch-and-Spectrum-Aware Singing Quality Assessment with Bias Correction and Model Fusion

TL;DR

The paper tackles automatic singing quality assessment (MOS) for SVS/SVC outputs by introducing PS-SQA, a pitch- and spectrum-aware enhancement of SSL-based MOS predictors. PS-SQA integrates pitch cues via compressed-pitch and pitch-histogram conditioning and spectral cues via a non-quantized APCodec-based representation, augmented with a bias-correction branch and model fusion to leverage predictor diversity. Trained on SingMOS with the loss , PS-SQA achieves superior system-level SRCC and related metrics, outperforming competing track-2 systems and closely approaching or surpassing the official baseline. This approach demonstrates that explicit melodic and spectral information, together with imbalance-aware training and ensembling, substantially improves automatic singing quality evaluation with practical implications for SVS/SVC deployment.

Abstract

We participated in track 2 of the VoiceMOS Challenge 2024, which aimed to predict the mean opinion score (MOS) of singing samples. Our submission secured the first place among all participating teams, excluding the official baseline. In this paper, we further improve our submission and propose a novel Pitch-and-Spectrum-aware Singing Quality Assessment (PS-SQA) method. The PS-SQA is designed based on the self-supervised-learning (SSL) MOS predictor, incorporating singing pitch and spectral information, which are extracted using pitch histogram and non-quantized neural codec, respectively. Additionally, the PS-SQA introduces a bias correction strategy to address prediction biases caused by low-resource training samples, and employs model fusion technology to further enhance prediction accuracy. Experimental results confirm that our proposed PS-SQA significantly outperforms all competing systems across all system-level metrics, confirming its strong sing quality assessment capabilities.

Paper Structure

This paper contains 16 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: A block diagram of a plain SSL-based MOS predictor.
  • Figure 2: The pitch histograms of (a) a good singing voice with MOS of 5.0 and (b) a poor singing voice with MOS of 2.4.
  • Figure 3: Architectures of the pitch-aware and spectrum-aware SSL-based MOS predictors.
  • Figure 4: Histograms of the number of samples in different MOS intervals for (a) training set and (b) validation set of SingMOS dataset.
  • Figure 5: The structure of the bias-correction branch.
  • ...and 1 more figures