Table of Contents
Fetching ...

End-to-End Zero-Shot Voice Conversion with Location-Variable Convolutions

Wonjune Kang, Mark Hasegawa-Johnson, Deb Roy

TL;DR

This work tackles zero-shot voice conversion with an end-to-end approach that eliminates the need for a separate vocoder. It introduces LVC-VC, a neural-vocoder–like model that employs location-variable convolutions conditioned on disentangled content and speaker features to jointly convert voice and synthesize time-domain audio. The method leverages kernel-predictor networks for LVC layers, self-reconstruction training with frequency-domain warping, Gaussian speaker embeddings, and an SSC loss, yielding robust generalization to unseen speakers while maintaining speech intelligibility. On VCTK, LVC-VC achieves a favorable balance between voice style transfer quality and intelligibility, with competitive MOS and the lowest CER, demonstrating the practicality of end-to-end zero-shot VC with a compact model footprint.

Abstract

Zero-shot voice conversion is becoming an increasingly popular research topic, as it promises the ability to transform speech to sound like any speaker. However, relatively little work has been done on end-to-end methods for this task, which are appealing because they remove the need for a separate vocoder to generate audio from intermediate features. In this work, we propose LVC-VC, an end-to-end zero-shot voice conversion model that uses location-variable convolutions (LVCs) to jointly model the conversion and speech synthesis processes. LVC-VC utilizes carefully designed input features that have disentangled content and speaker information, and it uses a neural vocoder-like architecture that utilizes LVCs to efficiently combine them and perform voice conversion while directly synthesizing time domain audio. Experiments show that our model achieves especially well balanced performance between voice style transfer and speech intelligibility compared to several baselines.

End-to-End Zero-Shot Voice Conversion with Location-Variable Convolutions

TL;DR

This work tackles zero-shot voice conversion with an end-to-end approach that eliminates the need for a separate vocoder. It introduces LVC-VC, a neural-vocoder–like model that employs location-variable convolutions conditioned on disentangled content and speaker features to jointly convert voice and synthesize time-domain audio. The method leverages kernel-predictor networks for LVC layers, self-reconstruction training with frequency-domain warping, Gaussian speaker embeddings, and an SSC loss, yielding robust generalization to unseen speakers while maintaining speech intelligibility. On VCTK, LVC-VC achieves a favorable balance between voice style transfer quality and intelligibility, with competitive MOS and the lowest CER, demonstrating the practicality of end-to-end zero-shot VC with a compact model footprint.

Abstract

Zero-shot voice conversion is becoming an increasingly popular research topic, as it promises the ability to transform speech to sound like any speaker. However, relatively little work has been done on end-to-end methods for this task, which are appealing because they remove the need for a separate vocoder to generate audio from intermediate features. In this work, we propose LVC-VC, an end-to-end zero-shot voice conversion model that uses location-variable convolutions (LVCs) to jointly model the conversion and speech synthesis processes. LVC-VC utilizes carefully designed input features that have disentangled content and speaker information, and it uses a neural vocoder-like architecture that utilizes LVCs to efficiently combine them and perform voice conversion while directly synthesizing time domain audio. Experiments show that our model achieves especially well balanced performance between voice style transfer and speech intelligibility compared to several baselines.
Paper Structure (15 sections, 5 equations, 3 figures, 2 tables)

This paper contains 15 sections, 5 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The components of the overall LVC-VC architecture. Content and speaker features are fed into the kernel predictors, which output kernels for the LVC layers in the generator. Each kernel predictor outputs the kernels for all four LVC blocks in a given transposed convolutional stack (shown in red, yellow, green, and blue at the right of (a) and top of (b)). $\mathbin{+\mkern-10mu+}$ denotes stacking/concatenation in (b).
  • Figure 2: Results of computing STFTs on outputs of the (a) first, (b) second, and (c) third transposed convolutional stacks. For brevity, we show only 4 of the 16 channels.
  • Figure 3: From left to right: spectrograms of the original utterance, audio generated when zeroing out speaker features, and audio generated when zeroing out content features.