Table of Contents
Fetching ...

SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis

Jiale Qian, Hao Meng, Tian Zheng, Pengcheng Zhu, Haopeng Lin, Yuhang Dai, Hanke Xie, Wenxiao Cao, Ruixuan Shang, Jun Wu, Hongmei Liu, Hanlin Wen, Jian Zhao, Zhonglin Jiang, Yong Chen, Shunshun Yin, Ming Tao, Jianguo Wei, Lei Xie, Xinsheng Wang

TL;DR

The paper tackles zero-shot, multilingual singing voice synthesis by addressing robustness and practical deployment challenges. It introduces SoulX-Singer, a large-scale, non-autoregressive SVS system that uses a flow-matching decoder with a Diffusion Transformer backbone and a Singing Content Encoder, trained on approximately $42{,}000$ hours of data across Mandarin, English, and Cantonese, and supports both MIDI score and melody conditioning. A dedicated data processing pipeline and two benchmarks, GMO-SVS and SoulX-Singer-Eval, enable rigorous, train-test disentangled evaluation of zero-shot performance. Results show state-of-the-art performance across languages and metrics such as FFE, WER, SIM, SingMOS, and Sheet, with strong cross-lingual identity preservation. This work provides a practical framework for production-grade, zero-shot SVS and paves the way for future research in expressive multilingual singing synthesis.

Abstract

While recent years have witnessed rapid progress in speech synthesis, open-source singing voice synthesis (SVS) systems still face significant barriers to industrial deployment, particularly in terms of robustness and zero-shot generalization. In this report, we introduce SoulX-Singer, a high-quality open-source SVS system designed with practical deployment considerations in mind. SoulX-Singer supports controllable singing generation conditioned on either symbolic musical scores (MIDI) or melodic representations, enabling flexible and expressive control in real-world production workflows. Trained on more than 42,000 hours of vocal data, the system supports Mandarin Chinese, English, and Cantonese and consistently achieves state-of-the-art synthesis quality across languages under diverse musical conditions. Furthermore, to enable reliable evaluation of zero-shot SVS performance in practical scenarios, we construct SoulX-Singer-Eval, a dedicated benchmark with strict training-test disentanglement, facilitating systematic assessment in zero-shot settings.

SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis

TL;DR

The paper tackles zero-shot, multilingual singing voice synthesis by addressing robustness and practical deployment challenges. It introduces SoulX-Singer, a large-scale, non-autoregressive SVS system that uses a flow-matching decoder with a Diffusion Transformer backbone and a Singing Content Encoder, trained on approximately hours of data across Mandarin, English, and Cantonese, and supports both MIDI score and melody conditioning. A dedicated data processing pipeline and two benchmarks, GMO-SVS and SoulX-Singer-Eval, enable rigorous, train-test disentangled evaluation of zero-shot performance. Results show state-of-the-art performance across languages and metrics such as FFE, WER, SIM, SingMOS, and Sheet, with strong cross-lingual identity preservation. This work provides a practical framework for production-grade, zero-shot SVS and paves the way for future research in expressive multilingual singing synthesis.

Abstract

While recent years have witnessed rapid progress in speech synthesis, open-source singing voice synthesis (SVS) systems still face significant barriers to industrial deployment, particularly in terms of robustness and zero-shot generalization. In this report, we introduce SoulX-Singer, a high-quality open-source SVS system designed with practical deployment considerations in mind. SoulX-Singer supports controllable singing generation conditioned on either symbolic musical scores (MIDI) or melodic representations, enabling flexible and expressive control in real-world production workflows. Trained on more than 42,000 hours of vocal data, the system supports Mandarin Chinese, English, and Cantonese and consistently achieves state-of-the-art synthesis quality across languages under diverse musical conditions. Furthermore, to enable reliable evaluation of zero-shot SVS performance in practical scenarios, we construct SoulX-Singer-Eval, a dedicated benchmark with strict training-test disentanglement, facilitating systematic assessment in zero-shot settings.
Paper Structure (19 sections, 3 figures, 3 tables)

This paper contains 19 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Performance of SoulX-Singer.
  • Figure 2: Pipeline for large-scale singing data curation: from raw audio extraction to time-aligned MIDI and text formulation.
  • Figure 3: Overview of SoulX-Singer.