ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps
Yulin Song, Guorui Sang, Jing Yu, Chuangbai Xiao
TL;DR
This work addresses efficient, high-fidelity singing voice synthesis by introducing ConSinger, a consistency-model–based SVS system that operates with minimal sampling steps and without a teacher model. The architecture combines a music-score encoder, a supplementary mel-spectrogram decoder, a CM-Denoiser, a Scorer for optimal restoration point, and a HiFi-GAN vocoder, all trained in a single network. Experiments on the PopCS dataset show ConSinger, particularly its v3 variant, achieves competitive speed with markedly improved subjective and objective quality over baselines like DiffSinger, aided by scorer-guided point selection and prior knowledge from a lightweight acoustic model. While promising, the approach incurs a training burden due to the supplementary decoder, and future work aims to better align scorer objectives with perceptual quality to push further gains in both speed and fidelity.
Abstract
Singing voice synthesis (SVS) system is expected to generate high-fidelity singing voice from given music scores (lyrics, duration and pitch). Recently, diffusion models have performed well in this field. However, sacrificing inference speed to exchange with high-quality sample generation limits its application scenarios. In order to obtain high quality synthetic singing voice more efficiently, we propose a singing voice synthesis method based on the consistency model, ConSinger, to achieve high-fidelity singing voice synthesis with minimal steps. The model is trained by applying consistency constraint and the generation quality is greatly improved at the expense of a small amount of inference speed. Our experiments show that ConSinger is highly competitive with the baseline model in terms of generation speed and quality. Audio samples are available at https://keylxiao.github.io/consinger.
