Table of Contents
Fetching ...

kNN-SVC: Robust Zero-Shot Singing Voice Conversion with Additive Synthesis and Concatenation Smoothness Optimization

Keren Shao, Ke Chen, Matthew Baas, Shlomo Dubnov

TL;DR

Zero-shot singing voice conversion often suffers from dull timbre due to insufficient harmonic emphasis and from poor temporal coherence in frame-wise candidate selection. The paper introduces kNN-SVC, which adds harmonic content via additive synthesis by constructing $U(t)=\sum_{n=1}^N A_n(t)\sin(\mathrm{cumsum}(2\pi n f_0(t)))$ and injects it into a HiFi-GAN vocoder, and enforces temporal coherence with a distance $L_{\mathrm{total}}(C,t)=\mathrm{cosine\_sim}(C,S_t)+m\,\mathrm{median}(\{\mathrm{cosine\_sim}(C,C')\,|\,C'\in\mathcal{A}_{t-1}\})$, followed by autoregressive reselection and weighted concatenation. Empirical results on LibriSpeech, OpenSinger, and NUS48E show improvements in EER, MOS, and SIM with minimal WER/CER changes, outperforming kNN-VC and NeuCoSVC in zero-shot tasks. The approach is non-parametric and broadly applicable to concatenative neural synthesis, with code and demo released for public use.

Abstract

Robustness is critical in zero-shot singing voice conversion (SVC). This paper introduces two novel methods to strengthen the robustness of the kNN-VC framework for SVC. First, kNN-VC's core representation, WavLM, lacks harmonic emphasis, resulting in dull sounds and ringing artifacts. To address this, we leverage the bijection between WavLM, pitch contours, and spectrograms to perform additive synthesis, integrating the resulting waveform into the model to mitigate these issues. Second, kNN-VC overlooks concatenative smoothness, a key perceptual factor in SVC. To enhance smoothness, we propose a new distance metric that filters out unsuitable kNN candidates and optimize the summing weights of the candidates during inference. Although our techniques are built on the kNN-VC framework for implementation convenience, they are broadly applicable to general concatenative neural synthesis models. Experimental results validate the effectiveness of these modifications in achieving robust SVC. Demo: http://knnsvc.com Code: https://github.com/SmoothKen/knn-svc

kNN-SVC: Robust Zero-Shot Singing Voice Conversion with Additive Synthesis and Concatenation Smoothness Optimization

TL;DR

Zero-shot singing voice conversion often suffers from dull timbre due to insufficient harmonic emphasis and from poor temporal coherence in frame-wise candidate selection. The paper introduces kNN-SVC, which adds harmonic content via additive synthesis by constructing and injects it into a HiFi-GAN vocoder, and enforces temporal coherence with a distance , followed by autoregressive reselection and weighted concatenation. Empirical results on LibriSpeech, OpenSinger, and NUS48E show improvements in EER, MOS, and SIM with minimal WER/CER changes, outperforming kNN-VC and NeuCoSVC in zero-shot tasks. The approach is non-parametric and broadly applicable to concatenative neural synthesis, with code and demo released for public use.

Abstract

Robustness is critical in zero-shot singing voice conversion (SVC). This paper introduces two novel methods to strengthen the robustness of the kNN-VC framework for SVC. First, kNN-VC's core representation, WavLM, lacks harmonic emphasis, resulting in dull sounds and ringing artifacts. To address this, we leverage the bijection between WavLM, pitch contours, and spectrograms to perform additive synthesis, integrating the resulting waveform into the model to mitigate these issues. Second, kNN-VC overlooks concatenative smoothness, a key perceptual factor in SVC. To enhance smoothness, we propose a new distance metric that filters out unsuitable kNN candidates and optimize the summing weights of the candidates during inference. Although our techniques are built on the kNN-VC framework for implementation convenience, they are broadly applicable to general concatenative neural synthesis models. Experimental results validate the effectiveness of these modifications in achieving robust SVC. Demo: http://knnsvc.com Code: https://github.com/SmoothKen/knn-svc

Paper Structure

This paper contains 8 sections, 3 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Workflow of the kNN-SVC model: The left blue block represents Concatenation Smoothness Optimization, while the Additive Synthesis pipeline occupies the blue column on the right. The original kNN-VC backbone is composed of the remaining red blocks.
  • Figure 2: The process of creating additively synthesized waveform. We first extract the corresponding harmonic amplitude vector for each reference WavLM frame (top). With the orange star indicating the target pitch, we then select candidate frames with the closest pitches to perform additive synthesis (bottom).
  • Figure 3: The problem that inference-time Concatenative Smoothness Optimization attempts to address. The numbers in the matrix represent candidate indices from the reference utterance. The presence of short, mutually exclusive territories in the matrix is the primary cause of the trembling artifacts seen in the waveform above.
  • Figure 4: The process of inference-time Concatenation Smoothness Optimization. The numbers represent candidate indices from the reference utterance. Top: Autoregressively reselecting candidates based on a weighted sum of cosine similarity $L_{src}$ and concatenative cost $L_{concat}$. Bottom: Optimizing the summing weights towards minimizing the discrepancy between one's concatenation neighbors and its ideal continuations in the reference utterance.
  • Figure 5: Left: Spectrogram of the output without Additive Synthesis, showing a dull voice quality (due to the absence of high-frequency harmonics) and noticeable ringing artifacts (horizontal dark-blue lines between harmonics). Right: Spectrogram of the output with Additive Synthesis, demonstrating improved harmonic richness and a cleaner signal.
  • ...and 1 more figures