Table of Contents
Fetching ...

XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System

Peiling Lu, Jie Wu, Jian Luan, Xu Tan, Li Zhou

TL;DR

XiaoiceSing addresses the challenge of generating high-quality singing voice by integrating spectrum, F0, and duration modeling in a FastSpeech-inspired framework. The system adds musical-score inputs, a residual F0 connection, and a syllable-level duration loss, enabling joint optimization and consistent acoustic features via a WORLD vocoder. Experimental results show clear gains over a CNN baseline in sound quality, pronunciation accuracy, and naturalness, with strong F0 and duration modeling evidence from subjective and objective metrics. The work demonstrates that tightly coupled spectrum-F0-duration modeling with singing-specific design can significantly advance SVS practicality.

Abstract

This paper presents XiaoiceSing, a high-quality singing voice synthesis system which employs an integrated network for spectrum, F0 and duration modeling. We follow the main architecture of FastSpeech while proposing some singing-specific design: 1) Besides phoneme ID and position encoding, features from musical score (e.g.note pitch and length) are also added. 2) To attenuate off-key issues, we add a residual connection in F0 prediction. 3) In addition to the duration loss of each phoneme, the duration of all the phonemes in a musical note is accumulated to calculate the syllable duration loss for rhythm enhancement. Experiment results show that XiaoiceSing outperforms the baseline system of convolutional neural networks by 1.44 MOS on sound quality, 1.18 on pronunciation accuracy and 1.38 on naturalness respectively. In two A/B tests, the proposed F0 and duration modeling methods achieve 97.3% and 84.3% preference rate over baseline respectively, which demonstrates the overwhelming advantages of XiaoiceSing.

XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System

TL;DR

XiaoiceSing addresses the challenge of generating high-quality singing voice by integrating spectrum, F0, and duration modeling in a FastSpeech-inspired framework. The system adds musical-score inputs, a residual F0 connection, and a syllable-level duration loss, enabling joint optimization and consistent acoustic features via a WORLD vocoder. Experimental results show clear gains over a CNN baseline in sound quality, pronunciation accuracy, and naturalness, with strong F0 and duration modeling evidence from subjective and objective metrics. The work demonstrates that tightly coupled spectrum-F0-duration modeling with singing-specific design can significantly advance SVS practicality.

Abstract

This paper presents XiaoiceSing, a high-quality singing voice synthesis system which employs an integrated network for spectrum, F0 and duration modeling. We follow the main architecture of FastSpeech while proposing some singing-specific design: 1) Besides phoneme ID and position encoding, features from musical score (e.g.note pitch and length) are also added. 2) To attenuate off-key issues, we add a residual connection in F0 prediction. 3) In addition to the duration loss of each phoneme, the duration of all the phonemes in a musical note is accumulated to calculate the syllable duration loss for rhythm enhancement. Experiment results show that XiaoiceSing outperforms the baseline system of convolutional neural networks by 1.44 MOS on sound quality, 1.18 on pronunciation accuracy and 1.38 on naturalness respectively. In two A/B tests, the proposed F0 and duration modeling methods achieve 97.3% and 84.3% preference rate over baseline respectively, which demonstrates the overwhelming advantages of XiaoiceSing.

Paper Structure

This paper contains 12 sections, 3 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Architecture of XiaoiceSing based on modified FastSpeech
  • Figure 2: Musical score representation
  • Figure 3: Averaged GVs of Mel-Generalized Cepstrum coefficients.
  • Figure 4: (a) mel-spectrum of samples predicted by Baseline. (b) mel-spectrum of samples predicted by XiaoiceSing.
  • Figure 5: A/B preference test results for F0 and duration.
  • ...and 2 more figures