Table of Contents
Fetching ...

Singing Voice Conversion with Accompaniment Using Self-Supervised Representation-Based Melody Features

Wei Chen, Binzhu Sha, Jing Yang, Zhuo Wang, Fan Fan, Zhiyong Wu

TL;DR

This work addresses singing voice conversion in the presence of background music, where robust melody modeling is essential but difficult. It proposes a self-supervised representation based melody extractor that combines HuBERT or WavLM with a weighted-sum of layer outputs and FFT blocks, integrated into an encoder–decoder SVC framework with ASR-derived BNFs, discriminators, and a HiFi-GAN vocoder. The authors demonstrate that SSL-based melody features improve melody accuracy under noisy accompaniment and outperform state-of-the-art baselines in both objective metrics such as $F0RMSE$ and $F0CORR$, and subjective MOS assessments. They also analyze how SSL layers contribute to melody information, showing that fine-tuning enables higher layers to model melody effectively, enabling robust any-to-one SVC with accompaniment and suggesting broader applicability in challenging real-world audio conditions.

Abstract

Melody preservation is crucial in singing voice conversion (SVC). However, in many scenarios, audio is often accompanied with background music (BGM), which can cause audio distortion and interfere with the extraction of melody and other key features, significantly degrading SVC performance. Previous methods have attempted to address this by using more robust neural network-based melody extractors, but their performance drops sharply in the presence of complex accompaniment. Other approaches involve performing source separation before conversion, but this often introduces noticeable artifacts, leading to a significant drop in conversion quality and increasing the user's operational costs. To address these issues, we introduce a novel SVC method that uses self-supervised representation-based melody features to improve melody modeling accuracy in the presence of BGM. In our experiments, we compare the effectiveness of different self-supervised learning (SSL) models for melody extraction and explore for the first time how SSL benefits the task of melody extraction. The experimental results demonstrate that our proposed SVC model significantly outperforms existing baseline methods in terms of melody accuracy and shows higher similarity and naturalness in both subjective and objective evaluations across noisy and clean audio environments.

Singing Voice Conversion with Accompaniment Using Self-Supervised Representation-Based Melody Features

TL;DR

This work addresses singing voice conversion in the presence of background music, where robust melody modeling is essential but difficult. It proposes a self-supervised representation based melody extractor that combines HuBERT or WavLM with a weighted-sum of layer outputs and FFT blocks, integrated into an encoder–decoder SVC framework with ASR-derived BNFs, discriminators, and a HiFi-GAN vocoder. The authors demonstrate that SSL-based melody features improve melody accuracy under noisy accompaniment and outperform state-of-the-art baselines in both objective metrics such as and , and subjective MOS assessments. They also analyze how SSL layers contribute to melody information, showing that fine-tuning enables higher layers to model melody effectively, enabling robust any-to-one SVC with accompaniment and suggesting broader applicability in challenging real-world audio conditions.

Abstract

Melody preservation is crucial in singing voice conversion (SVC). However, in many scenarios, audio is often accompanied with background music (BGM), which can cause audio distortion and interfere with the extraction of melody and other key features, significantly degrading SVC performance. Previous methods have attempted to address this by using more robust neural network-based melody extractors, but their performance drops sharply in the presence of complex accompaniment. Other approaches involve performing source separation before conversion, but this often introduces noticeable artifacts, leading to a significant drop in conversion quality and increasing the user's operational costs. To address these issues, we introduce a novel SVC method that uses self-supervised representation-based melody features to improve melody modeling accuracy in the presence of BGM. In our experiments, we compare the effectiveness of different self-supervised learning (SSL) models for melody extraction and explore for the first time how SSL benefits the task of melody extraction. The experimental results demonstrate that our proposed SVC model significantly outperforms existing baseline methods in terms of melody accuracy and shows higher similarity and naturalness in both subjective and objective evaluations across noisy and clean audio environments.

Paper Structure

This paper contains 12 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: SVC framework. Snowflake represents the parameter that remains unchanged when training the SVC framework.
  • Figure 2: Melody extractor.
  • Figure 3: Visualized weight that extract representations from SSL models.