Table of Contents
Fetching ...

Leveraging Diverse Semantic-based Audio Pretrained Models for Singing Voice Conversion

Xueyao Zhang, Zihao Fang, Yicheng Gu, Haopeng Chen, Lexiao Zou, Junan Zhang, Liumeng Xue, Zhizheng Wu

TL;DR

It is discovered that the knowledge of different models is diverse and can be complementary for SVC, and a Singing Voice Conversion framework based on Diverse Semantic-based Feature Fusion (DSFF-SVC) is designed.

Abstract

Singing Voice Conversion (SVC) is a technique that enables any singer to perform any song. To achieve this, it is essential to obtain speaker-agnostic representations from the source audio, which poses a significant challenge. A common solution involves utilizing a semantic-based audio pretrained model as a feature extractor. However, the degree to which the extracted features can meet the SVC requirements remains an open question. This includes their capability to accurately model melody and lyrics, the speaker-independency of their underlying acoustic information, and their robustness for in-the-wild acoustic environments. In this study, we investigate the knowledge within classical semantic-based pretrained models in much detail. We discover that the knowledge of different models is diverse and can be complementary for SVC. Based on the above, we design a Singing Voice Conversion framework based on Diverse Semantic-based Feature Fusion (DSFF-SVC). Experimental results demonstrate that DSFF-SVC can be generalized and improve various existing SVC models, particularly in challenging real-world conversion tasks. Our demo website is available at https://diversesemanticsvc.github.io/.

Leveraging Diverse Semantic-based Audio Pretrained Models for Singing Voice Conversion

TL;DR

It is discovered that the knowledge of different models is diverse and can be complementary for SVC, and a Singing Voice Conversion framework based on Diverse Semantic-based Feature Fusion (DSFF-SVC) is designed.

Abstract

Singing Voice Conversion (SVC) is a technique that enables any singer to perform any song. To achieve this, it is essential to obtain speaker-agnostic representations from the source audio, which poses a significant challenge. A common solution involves utilizing a semantic-based audio pretrained model as a feature extractor. However, the degree to which the extracted features can meet the SVC requirements remains an open question. This includes their capability to accurately model melody and lyrics, the speaker-independency of their underlying acoustic information, and their robustness for in-the-wild acoustic environments. In this study, we investigate the knowledge within classical semantic-based pretrained models in much detail. We discover that the knowledge of different models is diverse and can be complementary for SVC. Based on the above, we design a Singing Voice Conversion framework based on Diverse Semantic-based Feature Fusion (DSFF-SVC). Experimental results demonstrate that DSFF-SVC can be generalized and improve various existing SVC models, particularly in challenging real-world conversion tasks. Our demo website is available at https://diversesemanticsvc.github.io/.
Paper Structure (16 sections, 2 equations, 3 figures, 6 tables)

This paper contains 16 sections, 2 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The role of semantic-based pretrained model in the classic singing voice conversion pipeline.
  • Figure 2: The proposed Singing Voice Conversion framework based on Diverse Semantic-based Features Fusion (DSFF-SVC). It is capable of incorporating most existing models (i.e., acoustic model and waveform decoder) as a base.
  • Figure 3: The complementary role of diverse semantic-based features in melody modeling. More benefits of the joint usage of diverse semantic-based features (including spectrogram reconstruction, lyrics modeling, etc.) can be seen at our https://diversesemanticsvc.github.io/content_features.html.