Table of Contents
Fetching ...

Neural Concatenative Singing Voice Conversion: Rethinking Concatenation-Based Approach for One-Shot Singing Voice Conversion

Binzhu Sha, Xu Li, Zhiyong Wu, Ying Shan, Helen Meng

TL;DR

NeuCoSVC tackles timbre leakage in any-to-any SVC by reusing SSL representations from a reference speaker and integrating explicit pitch modeling with neural harmonic signals. During inference, SSL features from the reference are matched to the source via $k=4$ nearest neighbors and fed into a neural harmonic generator that builds harmonics up to $K \approx f_s/(2 f_0[n])$, followed by FiLM-conditioned synthesis. The approach, tested on OpenSinger and NUS-48E, yields superior naturalness and voice similarity over a disentanglement-based baseline across intra-language, cross-language, and cross-domain settings, with a duration study showing diminishing returns beyond roughly 60 seconds. These results support a practical, high-quality one-shot SVC framework and the authors release code and samples for public use.

Abstract

Any-to-any singing voice conversion (SVC) is confronted with the challenge of ``timbre leakage'' issue caused by inadequate disentanglement between the content and the speaker timbre. To address this issue, this study introduces NeuCoSVC, a novel neural concatenative SVC framework. It consists of a self-supervised learning (SSL) representation extractor, a neural harmonic signal generator, and a waveform synthesizer. The SSL extractor condenses audio into fixed-dimensional SSL features, while the harmonic signal generator leverages linear time-varying filters to produce both raw and filtered harmonic signals for pitch information. The synthesizer reconstructs waveforms using SSL features, harmonic signals, and loudness information. During inference, voice conversion is performed by substituting source SSL features with their nearest counterparts from a matching pool which comprises SSL features extracted from the reference audio, while preserving raw harmonic signals and loudness from the source audio. By directly utilizing SSL features from the reference audio, the proposed framework effectively resolves the ``timbre leakage" issue caused by previous disentanglement-based approaches. Experimental results demonstrate that the proposed NeuCoSVC system outperforms the disentanglement-based speaker embedding approach in one-shot SVC across intra-language, cross-language, and cross-domain evaluations.

Neural Concatenative Singing Voice Conversion: Rethinking Concatenation-Based Approach for One-Shot Singing Voice Conversion

TL;DR

NeuCoSVC tackles timbre leakage in any-to-any SVC by reusing SSL representations from a reference speaker and integrating explicit pitch modeling with neural harmonic signals. During inference, SSL features from the reference are matched to the source via nearest neighbors and fed into a neural harmonic generator that builds harmonics up to , followed by FiLM-conditioned synthesis. The approach, tested on OpenSinger and NUS-48E, yields superior naturalness and voice similarity over a disentanglement-based baseline across intra-language, cross-language, and cross-domain settings, with a duration study showing diminishing returns beyond roughly 60 seconds. These results support a practical, high-quality one-shot SVC framework and the authors release code and samples for public use.

Abstract

Any-to-any singing voice conversion (SVC) is confronted with the challenge of ``timbre leakage'' issue caused by inadequate disentanglement between the content and the speaker timbre. To address this issue, this study introduces NeuCoSVC, a novel neural concatenative SVC framework. It consists of a self-supervised learning (SSL) representation extractor, a neural harmonic signal generator, and a waveform synthesizer. The SSL extractor condenses audio into fixed-dimensional SSL features, while the harmonic signal generator leverages linear time-varying filters to produce both raw and filtered harmonic signals for pitch information. The synthesizer reconstructs waveforms using SSL features, harmonic signals, and loudness information. During inference, voice conversion is performed by substituting source SSL features with their nearest counterparts from a matching pool which comprises SSL features extracted from the reference audio, while preserving raw harmonic signals and loudness from the source audio. By directly utilizing SSL features from the reference audio, the proposed framework effectively resolves the ``timbre leakage" issue caused by previous disentanglement-based approaches. Experimental results demonstrate that the proposed NeuCoSVC system outperforms the disentanglement-based speaker embedding approach in one-shot SVC across intra-language, cross-language, and cross-domain evaluations.
Paper Structure (16 sections, 3 equations, 2 figures, 2 tables)

This paper contains 16 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The structure of the proposed SVC system: (a) the SSL feature extracting and matching module; (b) the neural harmonic signal generator; (c) the audio synthesizer.
  • Figure 2: Experimental results of the duration study. MOS results are calculated with 95% confidence intervals.