CoMoSVC: Consistency Model-based Singing Voice Conversion
Yiwen Lu, Zhen Ye, Wei Xue, Xu Tan, Qifeng Liu, Yike Guo
TL;DR
CoMoSVC addresses the slow sampling of diffusion-based singing voice conversion by leveraging a consistency-model framework. It trains a diffusion-based teacher and distills a one-step student via consistency constraints, enabling fast, high-fidelity SVC conditioned on content and target singer identity. Empirical results demonstrate that CoMoSVC delivers comparable or superior naturalness and speaker similarity while achieving large inference-speed gains (orders of magnitude faster than diffusion baselines). This approach offers a practical path toward real-time, high-quality SVC in applications requiring fast turnaround and strong timbre fidelity.
Abstract
The diffusion-based Singing Voice Conversion (SVC) methods have achieved remarkable performances, producing natural audios with high similarity to the target timbre. However, the iterative sampling process results in slow inference speed, and acceleration thus becomes crucial. In this paper, we propose CoMoSVC, a consistency model-based SVC method, which aims to achieve both high-quality generation and high-speed sampling. A diffusion-based teacher model is first specially designed for SVC, and a student model is further distilled under self-consistency properties to achieve one-step sampling. Experiments on a single NVIDIA GTX4090 GPU reveal that although CoMoSVC has a significantly faster inference speed than the state-of-the-art (SOTA) diffusion-based SVC system, it still achieves comparable or superior conversion performance based on both subjective and objective metrics. Audio samples and codes are available at https://comosvc.github.io/.
