Table of Contents
Fetching ...

YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases

Gongyu Chen, Xiaoyu Zhang, Zhenqiang Weng, Junjie Zheng, Da Shen, Chaofan Ding, Wei-Qiang Zhang, Zihao Chen

TL;DR

YingMusic-SVC targets real-world zero-shot singing voice conversion by integrating continuous pre-training, robust supervised fine-tuning, and Flow-GRPO reinforcement learning. It introduces singing-specific inductive biases—the RVC timbre shifter, F0-aware timbre adaptor, and energy-balanced flow loss—within a diffusion-Transformer framework, plus robust augmentation strategies to handle accompaniment and harmony. Empirical results show state-of-the-art performance across timbre fidelity, intelligibility, and perceptual naturalness, with strong robustness under harmony-contaminated and real-world conditions, validating industrial viability. The work also provides an open, end-to-end SVC pipeline and multi-objective RL framework for perceptual optimization in singing synthesis.

Abstract

Singing voice conversion (SVC) aims to render the target singer's timbre while preserving melody and lyrics. However, existing zero-shot SVC systems remain fragile in real songs due to harmony interference, F0 errors, and the lack of inductive biases for singing. We propose YingMusic-SVC, a robust zero-shot framework that unifies continuous pre-training, robust supervised fine-tuning, and Flow-GRPO reinforcement learning. Our model introduces a singing-trained RVC timbre shifter for timbre-content disentanglement, an F0-aware timbre adaptor for dynamic vocal expression, and an energy-balanced rectified flow matching loss to enhance high-frequency fidelity. Experiments on a graded multi-track benchmark show that YingMusic-SVC achieves consistent improvements over strong open-source baselines in timbre similarity, intelligibility, and perceptual naturalness, especially under accompanied and harmony-contaminated conditions, demonstrating its effectiveness for real-world SVC deployment.

YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases

TL;DR

YingMusic-SVC targets real-world zero-shot singing voice conversion by integrating continuous pre-training, robust supervised fine-tuning, and Flow-GRPO reinforcement learning. It introduces singing-specific inductive biases—the RVC timbre shifter, F0-aware timbre adaptor, and energy-balanced flow loss—within a diffusion-Transformer framework, plus robust augmentation strategies to handle accompaniment and harmony. Empirical results show state-of-the-art performance across timbre fidelity, intelligibility, and perceptual naturalness, with strong robustness under harmony-contaminated and real-world conditions, validating industrial viability. The work also provides an open, end-to-end SVC pipeline and multi-objective RL framework for perceptual optimization in singing synthesis.

Abstract

Singing voice conversion (SVC) aims to render the target singer's timbre while preserving melody and lyrics. However, existing zero-shot SVC systems remain fragile in real songs due to harmony interference, F0 errors, and the lack of inductive biases for singing. We propose YingMusic-SVC, a robust zero-shot framework that unifies continuous pre-training, robust supervised fine-tuning, and Flow-GRPO reinforcement learning. Our model introduces a singing-trained RVC timbre shifter for timbre-content disentanglement, an F0-aware timbre adaptor for dynamic vocal expression, and an energy-balanced rectified flow matching loss to enhance high-frequency fidelity. Experiments on a graded multi-track benchmark show that YingMusic-SVC achieves consistent improvements over strong open-source baselines in timbre similarity, intelligibility, and perceptual naturalness, especially under accompanied and harmony-contaminated conditions, demonstrating its effectiveness for real-world SVC deployment.

Paper Structure

This paper contains 28 sections, 23 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: From professionally produced songs to zero-shot SVC: baseline limitations and our improved pipeline with singing-oriented designs.
  • Figure 2: Three-Stage Training Framework of the Proposed YingMusic-SVC Model.
  • Figure 3: Key components of YingMusic-SVC. (a) F0-aware timbre adaptor that refines global timbre embeddings into fine-grained, pitch-sensitive representations. (b) Energy-balance flow matching loss with time- and frequency-dependent weighting to enhance high-frequency reconstruction fidelity.
  • Figure 4: RL ablation experiments on noise level $a$.
  • Figure 5: RL ablation experiments on window size.