Table of Contents
Fetching ...

SingMOS: An extensive Open-Source Singing Voice Dataset for MOS Prediction

Yuxun Tang, Jiatong Shi, Yuning Wu, Qin Jin

TL;DR

This work tackles the lack of publicly available MOS datasets for the singing domain by introducing SingMOS, a large open-access dataset comprising 3,421 Chinese and Japanese vocal clips (16 kHz, 4.25 hours) with ground-truth and synthetic vocals generated by 21 SVS, 11 SVC, and 6 vocoder systems. The dataset blends open-source ground-truth data with diverse synthesized samples to support robust MOS prediction in singing and includes extensive subjective annotations (approximately 15,000 MOS points) plus an unseen subset for generalization studies. A baseline Singing MOS predictor is built by fine-tuning self-supervised learning (SSL) models, using mean-pooling and a linear head with an $L_1$ loss, evaluated with MSE, LCC, SRCC, and KTAU. Experiments show that wav2vec2.0-base excels on the main test set, while HuBERT-base performs well on unseen data, and adding F0-related features does not improve performance, likely due to dataset size or SSL models already capturing pitch cues. SingMOS thus provides a valuable benchmark for singing MOS prediction and highlights directions for dataset expansion and model improvements, with the dataset to be released under CC-By-SA-NC 4.0.

Abstract

In speech generation tasks, human subjective ratings, usually referred to as the opinion score, are considered the "gold standard" for speech quality evaluation, with the mean opinion score (MOS) serving as the primary evaluation metric. Due to the high cost of human annotation, several MOS prediction systems have emerged in the speech domain, demonstrating good performance. These MOS prediction models are trained using annotations from previous speech-related challenges. However, compared to the speech domain, the singing domain faces data scarcity and stricter copyright protections, leading to a lack of high-quality MOS-annotated datasets for singing. To address this, we propose SingMOS, a high-quality and diverse MOS dataset for singing, covering a range of Chinese and Japanese datasets. These synthesized vocals are generated using state-of-the-art models in singing synthesis, conversion, or resynthesis tasks and are rated by professional annotators alongside real vocals. Data analysis demonstrates the diversity and reliability of our dataset. Additionally, we conduct further exploration on SingMOS, providing insights for singing MOS prediction and guidance for the continued expansion of SingMOS.

SingMOS: An extensive Open-Source Singing Voice Dataset for MOS Prediction

TL;DR

This work tackles the lack of publicly available MOS datasets for the singing domain by introducing SingMOS, a large open-access dataset comprising 3,421 Chinese and Japanese vocal clips (16 kHz, 4.25 hours) with ground-truth and synthetic vocals generated by 21 SVS, 11 SVC, and 6 vocoder systems. The dataset blends open-source ground-truth data with diverse synthesized samples to support robust MOS prediction in singing and includes extensive subjective annotations (approximately 15,000 MOS points) plus an unseen subset for generalization studies. A baseline Singing MOS predictor is built by fine-tuning self-supervised learning (SSL) models, using mean-pooling and a linear head with an loss, evaluated with MSE, LCC, SRCC, and KTAU. Experiments show that wav2vec2.0-base excels on the main test set, while HuBERT-base performs well on unseen data, and adding F0-related features does not improve performance, likely due to dataset size or SSL models already capturing pitch cues. SingMOS thus provides a valuable benchmark for singing MOS prediction and highlights directions for dataset expansion and model improvements, with the dataset to be released under CC-By-SA-NC 4.0.

Abstract

In speech generation tasks, human subjective ratings, usually referred to as the opinion score, are considered the "gold standard" for speech quality evaluation, with the mean opinion score (MOS) serving as the primary evaluation metric. Due to the high cost of human annotation, several MOS prediction systems have emerged in the speech domain, demonstrating good performance. These MOS prediction models are trained using annotations from previous speech-related challenges. However, compared to the speech domain, the singing domain faces data scarcity and stricter copyright protections, leading to a lack of high-quality MOS-annotated datasets for singing. To address this, we propose SingMOS, a high-quality and diverse MOS dataset for singing, covering a range of Chinese and Japanese datasets. These synthesized vocals are generated using state-of-the-art models in singing synthesis, conversion, or resynthesis tasks and are rated by professional annotators alongside real vocals. Data analysis demonstrates the diversity and reliability of our dataset. Additionally, we conduct further exploration on SingMOS, providing insights for singing MOS prediction and guidance for the continued expansion of SingMOS.
Paper Structure (12 sections, 3 figures, 4 tables)

This paper contains 12 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Histogram of the SingMOS dataset. The subfigure \ref{['subfig:dur_overall']} shows the duration distribution of the whole dataset whereas the subfigure \ref{['subfig:dur_sets']} demonstrates the duration distribution among the train/development/test splits. The subfigure \ref{['subfig:sys_mos']} and the subfigure \ref{['subfig:utt_mos']} show the overall system mos distribution and the utterance MOS distribution respectively.
  • Figure 2: Overview of source datasets and models distribution on whole data and the test sets of SingMOS dataset.
  • Figure 3: Scatter plot of system-level prediction results for each SSL model. * indicates large models with over 300M parameters.