Table of Contents
Fetching ...

VibE-SVC: Vibrato Extraction with High-frequency F0 Contour for Singing Voice Conversion

Joon-Seung Choi, Dong-Min Byun, Hyung-Seok Oh, Seong-Whan Lee

TL;DR

This work proposes VibESVC, a controllable singing voice conversion model that explicitly extracts and manipulates vibrato using discrete wavelet transform, enabling precise transfer and allowing vibrato control for enhanced flexibility.

Abstract

Controlling singing style is crucial for achieving an expressive and natural singing voice. Among the various style factors, vibrato plays a key role in conveying emotions and enhancing musical depth. However, modeling vibrato remains challenging due to its dynamic nature, making it difficult to control in singing voice conversion. To address this, we propose VibESVC, a controllable singing voice conversion model that explicitly extracts and manipulates vibrato using discrete wavelet transform. Unlike previous methods that model vibrato implicitly, our approach decomposes the F0 contour into frequency components, enabling precise transfer. This allows vibrato control for enhanced flexibility. Experimental results show that VibE-SVC effectively transforms singing styles while preserving speaker similarity. Both subjective and objective evaluations confirm high-quality conversion.

VibE-SVC: Vibrato Extraction with High-frequency F0 Contour for Singing Voice Conversion

TL;DR

This work proposes VibESVC, a controllable singing voice conversion model that explicitly extracts and manipulates vibrato using discrete wavelet transform, enabling precise transfer and allowing vibrato control for enhanced flexibility.

Abstract

Controlling singing style is crucial for achieving an expressive and natural singing voice. Among the various style factors, vibrato plays a key role in conveying emotions and enhancing musical depth. However, modeling vibrato remains challenging due to its dynamic nature, making it difficult to control in singing voice conversion. To address this, we propose VibESVC, a controllable singing voice conversion model that explicitly extracts and manipulates vibrato using discrete wavelet transform. Unlike previous methods that model vibrato implicitly, our approach decomposes the F0 contour into frequency components, enabling precise transfer. This allows vibrato control for enhanced flexibility. Experimental results show that VibE-SVC effectively transforms singing styles while preserving speaker similarity. Both subjective and objective evaluations confirm high-quality conversion.

Paper Structure

This paper contains 19 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The overview of the proposed VibE-SVC model.
  • Figure 2: F0 contour disentanglement: (a) Source F0 contour, (b) Reconstructed F0 contour from approximation coefficient, and (c) Reconstructed F0 contour from detail coefficients.
  • Figure 3: Correlation between MOS and style accuracy.
  • Figure 4: Comparison of F0 contours: vibrato-to-straight conversion (top) and straight-to-vibrato conversion (bottom).
  • Figure 5: (a), (b), and (c) show global-level vibrato scaling; (d) shows frame-level vibrato scaling.