HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios

Bingsong Bai; Yizhong Geng; Fengping Wang; Cong Wang; Puyuan Guo; Yingming Gao; Ya Li

HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios

Bingsong Bai, Yizhong Geng, Fengping Wang, Cong Wang, Puyuan Guo, Yingming Gao, Ya Li

TL;DR

HQ-SVC addresses the challenge of high-quality zero-shot singing voice conversion in low-resource settings by using a unified decoupled codec to extract content $x_{con}$ and speaker $x_{spk}$, augmented with an Enhanced Voice Adaptation (EVA) module for multi-feature fusion and a Speaker-F0 Predictor. Synthesis is progressively refined through Differentiable Digital Signal Processing (DDSP) and a diffusion model, with losses including $\,\mathcal{L}_{spk}$ and $\,\mathcal{L}_{f_0}$ to strengthen speaker discriminability and pitch accuracy; inference employs DPM-Solver++ with 100 steps and chunking, achieving practical latency. The approach achieves superior performance on zero-shot SVC benchmarks vs state-of-the-art baselines and also demonstrates strong zero-shot voice super-resolution, while enabling training on modest hardware and small datasets. This work advances practical, high-fidelity voice conversion and SR by leveraging joint content-speaker disentanglement, pitch-aware fusion, and diffusion-based refinement.

Abstract

Zero-shot singing voice conversion (SVC) transforms a source singer's timbre to an unseen target speaker's voice while preserving melodic content without fine-tuning. Existing methods model speaker timbre and vocal content separately, losing essential acoustic information that degrades output quality while requiring significant computational resources. To overcome these limitations, we propose HQ-SVC, an efficient framework for high-quality zero-shot SVC. HQ-SVC first extracts jointly content and speaker features using a decoupled codec. It then enhances fidelity through pitch and volume modeling, preserving critical acoustic information typically lost in separate modeling approaches, and progressively refines outputs via differentiable signal processing and diffusion techniques. Evaluations confirm HQ-SVC significantly outperforms state-of-the-art zero-shot SVC methods in conversion quality and efficiency. Beyond voice conversion, HQ-SVC achieves superior voice naturalness compared to specialized audio super-resolution methods while natively supporting voice super-resolution tasks.

HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios

TL;DR

HQ-SVC addresses the challenge of high-quality zero-shot singing voice conversion in low-resource settings by using a unified decoupled codec to extract content

and speaker

, augmented with an Enhanced Voice Adaptation (EVA) module for multi-feature fusion and a Speaker-F0 Predictor. Synthesis is progressively refined through Differentiable Digital Signal Processing (DDSP) and a diffusion model, with losses including

and

to strengthen speaker discriminability and pitch accuracy; inference employs DPM-Solver++ with 100 steps and chunking, achieving practical latency. The approach achieves superior performance on zero-shot SVC benchmarks vs state-of-the-art baselines and also demonstrates strong zero-shot voice super-resolution, while enabling training on modest hardware and small datasets. This work advances practical, high-fidelity voice conversion and SR by leveraging joint content-speaker disentanglement, pitch-aware fusion, and diffusion-based refinement.

HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios

TL;DR

Abstract

HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)