VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

Yifeng Yu; Jiatong Shi; Yuning Wu; Yuxun Tang; Shinji Watanabe

VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

Yifeng Yu, Jiatong Shi, Yuning Wu, Yuxun Tang, Shinji Watanabe

TL;DR

The paper tackles data scarcity in singing voice synthesis by leveraging unlabeled audio via self-supervised learning representations within the VISinger2 framework. It introduces VISinger2+, which fuses multi-layer SSL features with Mel-spectrograms to form an enriched embedding input $e$ for the posterior encoder, enabling end-to-end training with KL and adversarial losses. Evaluations on Mandarin and Japanese corpora (Opencpop, Ofuton-P, ACE-Opencpop) show consistent gains in objective metrics like MCD and ST Acc, as well as perceptual MOS, with different SSL models (HuBERT, MERT, CN-HuBERT) offering dataset-specific advantages. The approach enhances spectral fidelity and singer identity in multi-singer SVS and provides an open-source implementation for reproducibility.

Abstract

Singing Voice Synthesis (SVS) has witnessed significant advancements with the advent of deep learning techniques. However, a significant challenge in SVS is the scarcity of labeled singing voice data, which limits the effectiveness of supervised learning methods. In response to this challenge, this paper introduces a novel approach to enhance the quality of SVS by leveraging unlabeled data from pre-trained self-supervised learning models. Building upon the existing VISinger2 framework, this study integrates additional spectral feature information into the system to enhance its performance. The integration aims to harness the rich acoustic features from the pre-trained models, thereby enriching the synthesis and yielding a more natural and expressive singing voice. Experimental results in various corpora demonstrate the efficacy of this approach in improving the overall quality of synthesized singing voices in both objective and subjective metrics.

VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

TL;DR

for the posterior encoder, enabling end-to-end training with KL and adversarial losses. Evaluations on Mandarin and Japanese corpora (Opencpop, Ofuton-P, ACE-Opencpop) show consistent gains in objective metrics like MCD and ST Acc, as well as perceptual MOS, with different SSL models (HuBERT, MERT, CN-HuBERT) offering dataset-specific advantages. The approach enhances spectral fidelity and singer identity in multi-singer SVS and provides an open-source implementation for reproducibility.

Abstract

Paper Structure (12 sections, 2 figures, 3 tables)

This paper contains 12 sections, 2 figures, 3 tables.

Introduction
Method
Overview of VISinger2 Framework
Integration with pre-trained SSL Models
Training and Loss Function
Experiments
Experimental Datasets
Experimental Models
Experimental Setups
Evaluation
Results
Conclusion

Figures (2)

Figure 1: The proposed VISinger2+ architecture. Model details are discussed in Section \ref{['sec: method']}.
Figure 2: t-SNE Visualization of singer embeddings on synthesized audios from 4 singers in the ACE-Opencpop dataset.

VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

TL;DR

Abstract

VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)