Table of Contents
Fetching ...

BiSinger: Bilingual Singing Voice Synthesis

Huali Zhou, Yueqian Lin, Yao Shi, Peng Sun, Ming Li

TL;DR

BiSinger tackles bilingual SVS for English and Mandarin by introducing a language-independent CMU phoneme representation and dataset-adaptation strategies that fuse monolingual singing data with cross-language priors. It adds a language-style-infused encoder with learnable language embeddings and an Embedding Strength Modulator to capture dynamic language-phonology interactions, and it employs timbre conversion and pitch-shift pseudo-singing to expand data coverage. The approach demonstrates improved English and code-switch SVS while maintaining Chinese singing quality, validated through objective metrics, WER, and human MOS evaluations, with ablations confirming the value of each component. The work enables a single model to synthesize multilingual singing and suggests practical pathways for leveraging bilingual speech data in SVS.

Abstract

Although Singing Voice Synthesis (SVS) has made great strides with Text-to-Speech (TTS) techniques, multilingual singing voice modeling remains relatively unexplored. This paper presents BiSinger, a bilingual pop SVS system for English and Chinese Mandarin. Current systems require separate models per language and cannot accurately represent both Chinese and English, hindering code-switch SVS. To address this gap, we design a shared representation between Chinese and English singing voices, achieved by using the CMU dictionary with mapping rules. We fuse monolingual singing datasets with open-source singing voice conversion techniques to generate bilingual singing voices while also exploring the potential use of bilingual speech data. Experiments affirm that our language-independent representation and incorporation of related datasets enable a single model with enhanced performance in English and code-switch SVS while maintaining Chinese song performance. Audio samples are available at https://bisinger-svs.github.io.

BiSinger: Bilingual Singing Voice Synthesis

TL;DR

BiSinger tackles bilingual SVS for English and Mandarin by introducing a language-independent CMU phoneme representation and dataset-adaptation strategies that fuse monolingual singing data with cross-language priors. It adds a language-style-infused encoder with learnable language embeddings and an Embedding Strength Modulator to capture dynamic language-phonology interactions, and it employs timbre conversion and pitch-shift pseudo-singing to expand data coverage. The approach demonstrates improved English and code-switch SVS while maintaining Chinese singing quality, validated through objective metrics, WER, and human MOS evaluations, with ablations confirming the value of each component. The work enables a single model to synthesize multilingual singing and suggests practical pathways for leveraging bilingual speech data in SVS.

Abstract

Although Singing Voice Synthesis (SVS) has made great strides with Text-to-Speech (TTS) techniques, multilingual singing voice modeling remains relatively unexplored. This paper presents BiSinger, a bilingual pop SVS system for English and Chinese Mandarin. Current systems require separate models per language and cannot accurately represent both Chinese and English, hindering code-switch SVS. To address this gap, we design a shared representation between Chinese and English singing voices, achieved by using the CMU dictionary with mapping rules. We fuse monolingual singing datasets with open-source singing voice conversion techniques to generate bilingual singing voices while also exploring the potential use of bilingual speech data. Experiments affirm that our language-independent representation and incorporation of related datasets enable a single model with enhanced performance in English and code-switch SVS while maintaining Chinese song performance. Audio samples are available at https://bisinger-svs.github.io.
Paper Structure (19 sections, 3 equations, 5 figures, 8 tables)

This paper contains 19 sections, 3 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: System overview.
  • Figure 2: Language-style-infused encoder.
  • Figure 3: WER results.
  • Figure 4: Preference results in terms of pronunciation.
  • Figure 5: Substitution example.