YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases
Gongyu Chen, Xiaoyu Zhang, Zhenqiang Weng, Junjie Zheng, Da Shen, Chaofan Ding, Wei-Qiang Zhang, Zihao Chen
TL;DR
YingMusic-SVC targets real-world zero-shot singing voice conversion by integrating continuous pre-training, robust supervised fine-tuning, and Flow-GRPO reinforcement learning. It introduces singing-specific inductive biases—the RVC timbre shifter, F0-aware timbre adaptor, and energy-balanced flow loss—within a diffusion-Transformer framework, plus robust augmentation strategies to handle accompaniment and harmony. Empirical results show state-of-the-art performance across timbre fidelity, intelligibility, and perceptual naturalness, with strong robustness under harmony-contaminated and real-world conditions, validating industrial viability. The work also provides an open, end-to-end SVC pipeline and multi-objective RL framework for perceptual optimization in singing synthesis.
Abstract
Singing voice conversion (SVC) aims to render the target singer's timbre while preserving melody and lyrics. However, existing zero-shot SVC systems remain fragile in real songs due to harmony interference, F0 errors, and the lack of inductive biases for singing. We propose YingMusic-SVC, a robust zero-shot framework that unifies continuous pre-training, robust supervised fine-tuning, and Flow-GRPO reinforcement learning. Our model introduces a singing-trained RVC timbre shifter for timbre-content disentanglement, an F0-aware timbre adaptor for dynamic vocal expression, and an energy-balanced rectified flow matching loss to enhance high-frequency fidelity. Experiments on a graded multi-track benchmark show that YingMusic-SVC achieves consistent improvements over strong open-source baselines in timbre similarity, intelligibility, and perceptual naturalness, especially under accompanied and harmony-contaminated conditions, demonstrating its effectiveness for real-world SVC deployment.
