Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing
Jiatong Shi, Yueqian Lin, Xinyi Bai, Keyi Zhang, Yuning Wu, Yuxun Tang, Yifeng Yu, Qin Jin, Shinji Watanabe
TL;DR
Data scarcity is a major bottleneck in Singing Voice Synthesis (SVS). This work presents a novel data curation strategy that combines an existing singing synthesizer with manual tuning to produce two large, multi-singer corpora, ACE-Opencpop and ACE-KiSing, serving as new SVS benchmarks and augmentation resources. Through direct SVS, transfer learning, and joint training experiments using models like Xiaoice and VISinger2, the authors demonstrate that these corpora can improve both objective metrics and perceptual quality (MOS), with end-to-end approaches yielding strongest performance. The datasets, pre-trained models, and training recipes are released under CC-NC-4.0 at ESPnet-Muskits, enabling broad adoption and further research in large-scale SVS and cross-dataset transfer learning.
Abstract
In singing voice synthesis (SVS), generating singing voices from musical scores faces challenges due to limited data availability. This study proposes a unique strategy to address the data scarcity in SVS. We employ an existing singing voice synthesizer for data augmentation, complemented by detailed manual tuning, an approach not previously explored in data curation, to reduce instances of unnatural voice synthesis. This innovative method has led to the creation of two expansive singing voice datasets, ACE-Opencpop and ACE-KiSing, which are instrumental for large-scale, multi-singer voice synthesis. Through thorough experimentation, we establish that these datasets not only serve as new benchmarks for SVS but also enhance SVS performance on other singing voice datasets when used as supplementary resources. The corpora, pre-trained models, and their related training recipes are publicly available at ESPnet-Muskits (\url{https://github.com/espnet/espnet})
