CONTUNER: Singing Voice Beautifying with Pitch and Expressiveness Condition
Jianzong Wang, Pengcheng Li, Xulong Zhang, Ning Cheng, Jing Xiao
TL;DR
This work tackles singing voice beautification (SVB) by proposing ConTuner, a diffusion-based system that improves pitch and expressiveness without altering timbre or content and without requiring paired amateur-professional data. It introduces a generator-based diffusion backbone with a modified conditioning signal derived from a pitch predictor and an expressiveness enhancer, enabling control over the Mel-spectrogram generation through a condition $Con$ and leveraging the posterior $q(x_{t-1}|x_t,x_{0}')$ during inference. The pitch predictor maps MIDI and spectral envelope to a beautified pitch curve, while the expressiveness enhancer disentangles and refines expressiveness to align amateur performance with professional targets. Evaluations on the PASV dataset across Mandarin and English show improvements in pitch alignment (PAA), audio quality (MOS-Q), and expressiveness (MOS-E) over baselines, with ablations confirming the benefits of the expressiveness enhancer and the generator-based diffusion approach. This approach reduces reliance on paired data and offers a fast, high-fidelity SVB solution with practical implications for multilingual singing applications.
Abstract
Singing voice beautifying is a novel task that has application value in people's daily life, aiming to correct the pitch of the singing voice and improve the expressiveness without changing the original timbre and content. Existing methods rely on paired data or only concentrate on the correction of pitch. However, professional songs and amateur songs from the same person are hard to obtain, and singing voice beautifying doesn't only contain pitch correction but other aspects like emotion and rhythm. Since we propose a fast and high-fidelity singing voice beautifying system called ConTuner, a diffusion model combined with the modified condition to generate the beautified Mel-spectrogram, where the modified condition is composed of optimized pitch and expressiveness. For pitch correction, we establish a mapping relationship from MIDI, spectrum envelope to pitch. To make amateur singing more expressive, we propose the expressiveness enhancer in the latent space to convert amateur vocal tone to professional. ConTuner achieves a satisfactory beautification effect on both Mandarin and English songs. Ablation study demonstrates that the expressiveness enhancer and generator-based accelerate method in ConTuner are effective.
