Pitch-and-Spectrum-Aware Singing Quality Assessment with Bias Correction and Model Fusion
Yu-Fei Shi, Yang Ai, Ye-Xin Lu, Hui-Peng Du, Zhen-Hua Ling
TL;DR
The paper tackles automatic singing quality assessment (MOS) for SVS/SVC outputs by introducing PS-SQA, a pitch- and spectrum-aware enhancement of SSL-based MOS predictors. PS-SQA integrates pitch cues via compressed-pitch and pitch-histogram conditioning and spectral cues via a non-quantized APCodec-based representation, augmented with a bias-correction branch and model fusion to leverage predictor diversity. Trained on SingMOS with the loss $L_1 = \|\hat{y}-y\|_1$, PS-SQA achieves superior system-level SRCC and related metrics, outperforming competing track-2 systems and closely approaching or surpassing the official baseline. This approach demonstrates that explicit melodic and spectral information, together with imbalance-aware training and ensembling, substantially improves automatic singing quality evaluation with practical implications for SVS/SVC deployment.
Abstract
We participated in track 2 of the VoiceMOS Challenge 2024, which aimed to predict the mean opinion score (MOS) of singing samples. Our submission secured the first place among all participating teams, excluding the official baseline. In this paper, we further improve our submission and propose a novel Pitch-and-Spectrum-aware Singing Quality Assessment (PS-SQA) method. The PS-SQA is designed based on the self-supervised-learning (SSL) MOS predictor, incorporating singing pitch and spectral information, which are extracted using pitch histogram and non-quantized neural codec, respectively. Additionally, the PS-SQA introduces a bias correction strategy to address prediction biases caused by low-resource training samples, and employs model fusion technology to further enhance prediction accuracy. Experimental results confirm that our proposed PS-SQA significantly outperforms all competing systems across all system-level metrics, confirming its strong sing quality assessment capabilities.
