Table of Contents
Fetching ...

SingMOS-Pro: An Comprehensive Benchmark for Singing Quality Assessment

Yuxun Tang, Lan Liu, Wenhao Feng, Yiwen Zhao, Jionghao Han, Yifeng Yu, Jiatong Shi, Qin Jin

TL;DR

The paper tackles the challenge of evaluating singing quality by introducing SingMOS-Pro, a large, multilingual, multi-task MOS dataset for singing quality assessment. It combines the preview SingMOS with an extended data collection to provide 7,981 clips across SVS, SVC, SVR, and ground-truth samples, annotated for overall quality as well as lyric and melody aspects. The authors benchmark common SQA approaches, analyze training-set strategies like multi-dataset finetuning and domain identifiers, and demonstrate that integrating cross-batch information improves performance while identifying limitations of existing speech MOS models on singing data. The work provides strong baselines and practical guidance for future SQA research, enabling more robust automatic evaluation of singing voice generation systems and paving the way for incorporating melodic and lyrical cues into quality assessment.

Abstract

Singing voice generation progresses rapidly, yet evaluating singing quality remains a critical challenge. Human subjective assessment, typically in the form of listening tests, is costly and time consuming, while existing objective metrics capture only limited perceptual aspects. In this work, we introduce SingMOS-Pro, a dataset for automatic singing quality assessment. Building on our preview version SingMOS, which provides only overall ratings, SingMOS-Pro expands annotations of the additional part to include lyrics, melody, and overall quality, offering broader coverage and greater diversity. The dataset contains 7,981 singing clips generated by 41 models across 12 datasets, spanning from early systems to recent advances. Each clip receives at least five ratings from professional annotators, ensuring reliability and consistency. Furthermore, we explore how to effectively utilize MOS data annotated under different standards and benchmark several widely used evaluation methods from related tasks on SingMOS-Pro, establishing strong baselines and practical references for future research. The dataset can be accessed at https://huggingface.co/datasets/TangRain/SingMOS-Pro.

SingMOS-Pro: An Comprehensive Benchmark for Singing Quality Assessment

TL;DR

The paper tackles the challenge of evaluating singing quality by introducing SingMOS-Pro, a large, multilingual, multi-task MOS dataset for singing quality assessment. It combines the preview SingMOS with an extended data collection to provide 7,981 clips across SVS, SVC, SVR, and ground-truth samples, annotated for overall quality as well as lyric and melody aspects. The authors benchmark common SQA approaches, analyze training-set strategies like multi-dataset finetuning and domain identifiers, and demonstrate that integrating cross-batch information improves performance while identifying limitations of existing speech MOS models on singing data. The work provides strong baselines and practical guidance for future SQA research, enabling more robust automatic evaluation of singing voice generation systems and paving the way for incorporating melodic and lyrical cues into quality assessment.

Abstract

Singing voice generation progresses rapidly, yet evaluating singing quality remains a critical challenge. Human subjective assessment, typically in the form of listening tests, is costly and time consuming, while existing objective metrics capture only limited perceptual aspects. In this work, we introduce SingMOS-Pro, a dataset for automatic singing quality assessment. Building on our preview version SingMOS, which provides only overall ratings, SingMOS-Pro expands annotations of the additional part to include lyrics, melody, and overall quality, offering broader coverage and greater diversity. The dataset contains 7,981 singing clips generated by 41 models across 12 datasets, spanning from early systems to recent advances. Each clip receives at least five ratings from professional annotators, ensuring reliability and consistency. Furthermore, we explore how to effectively utilize MOS data annotated under different standards and benchmark several widely used evaluation methods from related tasks on SingMOS-Pro, establishing strong baselines and practical references for future research. The dataset can be accessed at https://huggingface.co/datasets/TangRain/SingMOS-Pro.

Paper Structure

This paper contains 13 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Distribution of systems and utterances in the SingMOS-Pro. Subfigure (a) illustrates the distribution of systems across different tasks in the dataset, while subfigure (b) shows the distribution of utterances within each task.
  • Figure 2: Distribution of Utterances Across MOS Intervals.
  • Figure 3: Distribution of Systems Across MOS Intervals.
  • Figure 4: Utterances Composition Statistics (few extreme outliers are discarded in visualization). Subfigure (a) presents the number of utterances in each system, and subfigure (b) depicts the distribution of utterance durations.
  • Figure 5: Distribution of Subset Across MOS Intervals.