Table of Contents
Fetching ...

ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps

Yulin Song, Guorui Sang, Jing Yu, Chuangbai Xiao

TL;DR

This work addresses efficient, high-fidelity singing voice synthesis by introducing ConSinger, a consistency-model–based SVS system that operates with minimal sampling steps and without a teacher model. The architecture combines a music-score encoder, a supplementary mel-spectrogram decoder, a CM-Denoiser, a Scorer for optimal restoration point, and a HiFi-GAN vocoder, all trained in a single network. Experiments on the PopCS dataset show ConSinger, particularly its v3 variant, achieves competitive speed with markedly improved subjective and objective quality over baselines like DiffSinger, aided by scorer-guided point selection and prior knowledge from a lightweight acoustic model. While promising, the approach incurs a training burden due to the supplementary decoder, and future work aims to better align scorer objectives with perceptual quality to push further gains in both speed and fidelity.

Abstract

Singing voice synthesis (SVS) system is expected to generate high-fidelity singing voice from given music scores (lyrics, duration and pitch). Recently, diffusion models have performed well in this field. However, sacrificing inference speed to exchange with high-quality sample generation limits its application scenarios. In order to obtain high quality synthetic singing voice more efficiently, we propose a singing voice synthesis method based on the consistency model, ConSinger, to achieve high-fidelity singing voice synthesis with minimal steps. The model is trained by applying consistency constraint and the generation quality is greatly improved at the expense of a small amount of inference speed. Our experiments show that ConSinger is highly competitive with the baseline model in terms of generation speed and quality. Audio samples are available at https://keylxiao.github.io/consinger.

ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps

TL;DR

This work addresses efficient, high-fidelity singing voice synthesis by introducing ConSinger, a consistency-model–based SVS system that operates with minimal sampling steps and without a teacher model. The architecture combines a music-score encoder, a supplementary mel-spectrogram decoder, a CM-Denoiser, a Scorer for optimal restoration point, and a HiFi-GAN vocoder, all trained in a single network. Experiments on the PopCS dataset show ConSinger, particularly its v3 variant, achieves competitive speed with markedly improved subjective and objective quality over baselines like DiffSinger, aided by scorer-guided point selection and prior knowledge from a lightweight acoustic model. While promising, the approach incurs a training burden due to the supplementary decoder, and future work aims to better align scorer objectives with perceptual quality to push further gains in both speed and fidelity.

Abstract

Singing voice synthesis (SVS) system is expected to generate high-fidelity singing voice from given music scores (lyrics, duration and pitch). Recently, diffusion models have performed well in this field. However, sacrificing inference speed to exchange with high-quality sample generation limits its application scenarios. In order to obtain high quality synthetic singing voice more efficiently, we propose a singing voice synthesis method based on the consistency model, ConSinger, to achieve high-fidelity singing voice synthesis with minimal steps. The model is trained by applying consistency constraint and the generation quality is greatly improved at the expense of a small amount of inference speed. Our experiments show that ConSinger is highly competitive with the baseline model in terms of generation speed and quality. Audio samples are available at https://keylxiao.github.io/consinger.

Paper Structure

This paper contains 13 sections, 9 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: The training and inference pipelines of ConSinger, the supplementary decoder and the scorer are included in the two light colored boxes. $m$ is the music score; t is the step number; $C_t$ and $C_m$ are their embedding information; $\tilde{x}$ is the mel-spectrogram generated by supplementary decoder; $x_t$ means the ground truth mel-spectrogram $x$ with $t$-level Gaussian noise; $op$ is the optimal point calculated by scorer. $k$ is the noise-add level calculated in place of $T$.
  • Figure 2: The subfigure (a) shows the mel-spectrogram quality restored from different noise level. The points on the x-axis with values of 0 and 1 represent GT and GT (mel+HiFi-GAN). The subfigure (b) shows the function curves of three important parameters of the model (Eq. (\ref{['cskipout']}) & Eq. (\ref{['sample t']}), $\rho = 7$) and the time step $t$ (Normalized) in the consistency model.