Table of Contents
Fetching ...

SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion

Bingsong Bai, Fengping Wang, Yingming Gao, Ya Li

TL;DR

This work tackles hoarseness in cross-domain singing voice conversion (SVC) caused by large pitch spans. It introduces SPA-SVC, which combines self-supervised pitch augmentation via cycle pitch shifting with a SSIM-based cycle-consistency loss, integrated into a diffusion-enhanced SVC pipeline built on DDSP and a pretrained NSF-HiFiGAN vocoder. Key contributions include random cycle pitch shifts in training (6–18 semitones), the $L_{total}=L_{cyc}+L_{diff}$ objective with $L_{cyc}$ based on SSIM, and end-to-end improvements without extra data or parameters. Evaluations on M4Singer show SPA-SVC improves MOS scores and spectrogram quality in both general and cross-domain SVC tasks, demonstrating robust performance across broader vocal ranges and reduced hoarseness, with practical implications for high-fidelity singing voice conversion.

Abstract

Diffusion-based singing voice conversion (SVC) models have shown better synthesis quality compared to traditional methods. However, in cross-domain SVC scenarios, where there is a significant disparity in pitch between the source and target voice domains, the models tend to generate audios with hoarseness, posing challenges in achieving high-quality vocal outputs. Therefore, in this paper, we propose a Self-supervised Pitch Augmentation method for Singing Voice Conversion (SPA-SVC), which can enhance the voice quality in SVC tasks without requiring additional data or increasing model parameters. We innovatively introduce a cycle pitch shifting training strategy and Structural Similarity Index (SSIM) loss into our SVC model, effectively enhancing its performance. Experimental results on the public singing datasets M4Singer indicate that our proposed method significantly improves model performance in both general SVC scenarios and particularly in cross-domain SVC scenarios.

SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion

TL;DR

This work tackles hoarseness in cross-domain singing voice conversion (SVC) caused by large pitch spans. It introduces SPA-SVC, which combines self-supervised pitch augmentation via cycle pitch shifting with a SSIM-based cycle-consistency loss, integrated into a diffusion-enhanced SVC pipeline built on DDSP and a pretrained NSF-HiFiGAN vocoder. Key contributions include random cycle pitch shifts in training (6–18 semitones), the objective with based on SSIM, and end-to-end improvements without extra data or parameters. Evaluations on M4Singer show SPA-SVC improves MOS scores and spectrogram quality in both general and cross-domain SVC tasks, demonstrating robust performance across broader vocal ranges and reduced hoarseness, with practical implications for high-fidelity singing voice conversion.

Abstract

Diffusion-based singing voice conversion (SVC) models have shown better synthesis quality compared to traditional methods. However, in cross-domain SVC scenarios, where there is a significant disparity in pitch between the source and target voice domains, the models tend to generate audios with hoarseness, posing challenges in achieving high-quality vocal outputs. Therefore, in this paper, we propose a Self-supervised Pitch Augmentation method for Singing Voice Conversion (SPA-SVC), which can enhance the voice quality in SVC tasks without requiring additional data or increasing model parameters. We innovatively introduce a cycle pitch shifting training strategy and Structural Similarity Index (SSIM) loss into our SVC model, effectively enhancing its performance. Experimental results on the public singing datasets M4Singer indicate that our proposed method significantly improves model performance in both general SVC scenarios and particularly in cross-domain SVC scenarios.
Paper Structure (17 sections, 5 equations, 3 figures, 1 table)

This paper contains 17 sections, 5 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: The pipeline of cycle pitch shifting.
  • Figure 2: The training and inference architecture of SPA-SVC.
  • Figure 3: Comparison of spectrograms.