Leveraging Diverse Semantic-based Audio Pretrained Models for Singing Voice Conversion

Xueyao Zhang; Zihao Fang; Yicheng Gu; Haopeng Chen; Lexiao Zou; Junan Zhang; Liumeng Xue; Zhizheng Wu

Leveraging Diverse Semantic-based Audio Pretrained Models for Singing Voice Conversion

Xueyao Zhang, Zihao Fang, Yicheng Gu, Haopeng Chen, Lexiao Zou, Junan Zhang, Liumeng Xue, Zhizheng Wu

TL;DR

It is discovered that the knowledge of different models is diverse and can be complementary for SVC, and a Singing Voice Conversion framework based on Diverse Semantic-based Feature Fusion (DSFF-SVC) is designed.

Abstract

Singing Voice Conversion (SVC) is a technique that enables any singer to perform any song. To achieve this, it is essential to obtain speaker-agnostic representations from the source audio, which poses a significant challenge. A common solution involves utilizing a semantic-based audio pretrained model as a feature extractor. However, the degree to which the extracted features can meet the SVC requirements remains an open question. This includes their capability to accurately model melody and lyrics, the speaker-independency of their underlying acoustic information, and their robustness for in-the-wild acoustic environments. In this study, we investigate the knowledge within classical semantic-based pretrained models in much detail. We discover that the knowledge of different models is diverse and can be complementary for SVC. Based on the above, we design a Singing Voice Conversion framework based on Diverse Semantic-based Feature Fusion (DSFF-SVC). Experimental results demonstrate that DSFF-SVC can be generalized and improve various existing SVC models, particularly in challenging real-world conversion tasks. Our demo website is available at https://diversesemanticsvc.github.io/.

Leveraging Diverse Semantic-based Audio Pretrained Models for Singing Voice Conversion

TL;DR

Abstract

Paper Structure (16 sections, 2 equations, 3 figures, 6 tables)

This paper contains 16 sections, 2 equations, 3 figures, 6 tables.

Introduction
Related Work
Methodology
Analysis for Semantic-based Pretrained Models
Features Fusion for Models of Mismatched Resolutions
Singing Voice Conversion Framework based on Diverse Semantic-based Features Fusion
Experiments
Experimental Setup
Evaluation Tasks
Evaluation Metrics
Base Models
Implementation Details
Performance of Different Semantic-based Features (EQ1)
Performance of the DSFF-SVC framework (EQ2)
Performance of the Resolution Transformation based Features Fusion (EQ3)
...and 1 more sections

Figures (3)

Figure 1: The role of semantic-based pretrained model in the classic singing voice conversion pipeline.
Figure 2: The proposed Singing Voice Conversion framework based on Diverse Semantic-based Features Fusion (DSFF-SVC). It is capable of incorporating most existing models (i.e., acoustic model and waveform decoder) as a base.
Figure 3: The complementary role of diverse semantic-based features in melody modeling. More benefits of the joint usage of diverse semantic-based features (including spectrogram reconstruction, lyrics modeling, etc.) can be seen at our https://diversesemanticsvc.github.io/content_features.html.

Leveraging Diverse Semantic-based Audio Pretrained Models for Singing Voice Conversion

TL;DR

Abstract

Leveraging Diverse Semantic-based Audio Pretrained Models for Singing Voice Conversion

Authors

TL;DR

Abstract

Table of Contents

Figures (3)