Table of Contents
Fetching ...

CONTUNER: Singing Voice Beautifying with Pitch and Expressiveness Condition

Jianzong Wang, Pengcheng Li, Xulong Zhang, Ning Cheng, Jing Xiao

TL;DR

This work tackles singing voice beautification (SVB) by proposing ConTuner, a diffusion-based system that improves pitch and expressiveness without altering timbre or content and without requiring paired amateur-professional data. It introduces a generator-based diffusion backbone with a modified conditioning signal derived from a pitch predictor and an expressiveness enhancer, enabling control over the Mel-spectrogram generation through a condition $Con$ and leveraging the posterior $q(x_{t-1}|x_t,x_{0}')$ during inference. The pitch predictor maps MIDI and spectral envelope to a beautified pitch curve, while the expressiveness enhancer disentangles and refines expressiveness to align amateur performance with professional targets. Evaluations on the PASV dataset across Mandarin and English show improvements in pitch alignment (PAA), audio quality (MOS-Q), and expressiveness (MOS-E) over baselines, with ablations confirming the benefits of the expressiveness enhancer and the generator-based diffusion approach. This approach reduces reliance on paired data and offers a fast, high-fidelity SVB solution with practical implications for multilingual singing applications.

Abstract

Singing voice beautifying is a novel task that has application value in people's daily life, aiming to correct the pitch of the singing voice and improve the expressiveness without changing the original timbre and content. Existing methods rely on paired data or only concentrate on the correction of pitch. However, professional songs and amateur songs from the same person are hard to obtain, and singing voice beautifying doesn't only contain pitch correction but other aspects like emotion and rhythm. Since we propose a fast and high-fidelity singing voice beautifying system called ConTuner, a diffusion model combined with the modified condition to generate the beautified Mel-spectrogram, where the modified condition is composed of optimized pitch and expressiveness. For pitch correction, we establish a mapping relationship from MIDI, spectrum envelope to pitch. To make amateur singing more expressive, we propose the expressiveness enhancer in the latent space to convert amateur vocal tone to professional. ConTuner achieves a satisfactory beautification effect on both Mandarin and English songs. Ablation study demonstrates that the expressiveness enhancer and generator-based accelerate method in ConTuner are effective.

CONTUNER: Singing Voice Beautifying with Pitch and Expressiveness Condition

TL;DR

This work tackles singing voice beautification (SVB) by proposing ConTuner, a diffusion-based system that improves pitch and expressiveness without altering timbre or content and without requiring paired amateur-professional data. It introduces a generator-based diffusion backbone with a modified conditioning signal derived from a pitch predictor and an expressiveness enhancer, enabling control over the Mel-spectrogram generation through a condition and leveraging the posterior during inference. The pitch predictor maps MIDI and spectral envelope to a beautified pitch curve, while the expressiveness enhancer disentangles and refines expressiveness to align amateur performance with professional targets. Evaluations on the PASV dataset across Mandarin and English show improvements in pitch alignment (PAA), audio quality (MOS-Q), and expressiveness (MOS-E) over baselines, with ablations confirming the benefits of the expressiveness enhancer and the generator-based diffusion approach. This approach reduces reliance on paired data and offers a fast, high-fidelity SVB solution with practical implications for multilingual singing applications.

Abstract

Singing voice beautifying is a novel task that has application value in people's daily life, aiming to correct the pitch of the singing voice and improve the expressiveness without changing the original timbre and content. Existing methods rely on paired data or only concentrate on the correction of pitch. However, professional songs and amateur songs from the same person are hard to obtain, and singing voice beautifying doesn't only contain pitch correction but other aspects like emotion and rhythm. Since we propose a fast and high-fidelity singing voice beautifying system called ConTuner, a diffusion model combined with the modified condition to generate the beautified Mel-spectrogram, where the modified condition is composed of optimized pitch and expressiveness. For pitch correction, we establish a mapping relationship from MIDI, spectrum envelope to pitch. To make amateur singing more expressive, we propose the expressiveness enhancer in the latent space to convert amateur vocal tone to professional. ConTuner achieves a satisfactory beautification effect on both Mandarin and English songs. Ablation study demonstrates that the expressiveness enhancer and generator-based accelerate method in ConTuner are effective.
Paper Structure (24 sections, 6 equations, 4 figures, 3 tables)

This paper contains 24 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Architecture of ConTuner. The pitch predictor conducts mapping from MIDI and envelope to pitch, while the expressiveness enhancer disentangles the expressiveness representation from the singing voice. The outputs from them are combined as the condition that takes part in the denoising process.
  • Figure 2: Details of the pitch predictor and expressiveness enhancer.
  • Figure 3: Structure of the spectrogram denoiser.
  • Figure 4: The pitch alignment accuracy of different algorithms on Mandarin and English songs.