Table of Contents
Fetching ...

SYKI-SVC: Advancing Singing Voice Conversion with Post-Processing Innovations and an Open-Source Professional Testset

Yiquan Zhou, Wenyu Wang, Hongwu Ding, Jiacheng Xu, Jihua Zhu, Xin Gao, Shihao Li

TL;DR

This work tackles high-fidelity singing voice conversion by explicitly disentangling timbre from linguistic content using SSL features (ContentVec) and ASR BNFs (Whisper), while leveraging $F_0$ from RMVPE and a VITS-based converter to reproduce target timbre. A novel post-processor directly supplements high-frequency content from the source to boost audio quality without compromising timbre, and an open-source professional testset enables robust evaluation of expressive singing techniques. The system, named SYKI-SVC, demonstrates state-of-the-art naturalness and timbre similarity, supported by extensive subjective and objective evaluations, and is trained with a diverse corpus at 24 kHz with upsampling to 48 kHz via post-processing. Together, these contributions advance SVC toward practical, professional-grade use and provide standardized data for future research.

Abstract

Singing voice conversion aims to transform a source singing voice into that of a target singer while preserving the original lyrics, melody, and various vocal techniques. In this paper, we propose a high-fidelity singing voice conversion system. Our system builds upon the SVCC T02 framework and consists of three key components: a feature extractor, a voice converter, and a post-processor. The feature extractor utilizes the ContentVec and Whisper models to derive F0 contours and extract speaker-independent linguistic features from the input singing voice. The voice converter then integrates the extracted timbre, F0, and linguistic content to synthesize the target speaker's waveform. The post-processor augments high-frequency information directly from the source through simple and effective signal processing to enhance audio quality. Due to the lack of a standardized professional dataset for evaluating expressive singing conversion systems, we have created and made publicly available a specialized test set. Comparative evaluations demonstrate that our system achieves a remarkably high level of naturalness, and further analysis confirms the efficacy of our proposed system design.

SYKI-SVC: Advancing Singing Voice Conversion with Post-Processing Innovations and an Open-Source Professional Testset

TL;DR

This work tackles high-fidelity singing voice conversion by explicitly disentangling timbre from linguistic content using SSL features (ContentVec) and ASR BNFs (Whisper), while leveraging from RMVPE and a VITS-based converter to reproduce target timbre. A novel post-processor directly supplements high-frequency content from the source to boost audio quality without compromising timbre, and an open-source professional testset enables robust evaluation of expressive singing techniques. The system, named SYKI-SVC, demonstrates state-of-the-art naturalness and timbre similarity, supported by extensive subjective and objective evaluations, and is trained with a diverse corpus at 24 kHz with upsampling to 48 kHz via post-processing. Together, these contributions advance SVC toward practical, professional-grade use and provide standardized data for future research.

Abstract

Singing voice conversion aims to transform a source singing voice into that of a target singer while preserving the original lyrics, melody, and various vocal techniques. In this paper, we propose a high-fidelity singing voice conversion system. Our system builds upon the SVCC T02 framework and consists of three key components: a feature extractor, a voice converter, and a post-processor. The feature extractor utilizes the ContentVec and Whisper models to derive F0 contours and extract speaker-independent linguistic features from the input singing voice. The voice converter then integrates the extracted timbre, F0, and linguistic content to synthesize the target speaker's waveform. The post-processor augments high-frequency information directly from the source through simple and effective signal processing to enhance audio quality. Due to the lack of a standardized professional dataset for evaluating expressive singing conversion systems, we have created and made publicly available a specialized test set. Comparative evaluations demonstrate that our system achieves a remarkably high level of naturalness, and further analysis confirms the efficacy of our proposed system design.
Paper Structure (17 sections, 5 equations, 1 figure, 2 tables)

This paper contains 17 sections, 5 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Fig. (a) shows the general description of the training inference, and Fig. (b) shows the details of the model