End-to-End Zero-Shot Voice Conversion with Location-Variable Convolutions
Wonjune Kang, Mark Hasegawa-Johnson, Deb Roy
TL;DR
This work tackles zero-shot voice conversion with an end-to-end approach that eliminates the need for a separate vocoder. It introduces LVC-VC, a neural-vocoder–like model that employs location-variable convolutions conditioned on disentangled content and speaker features to jointly convert voice and synthesize time-domain audio. The method leverages kernel-predictor networks for LVC layers, self-reconstruction training with frequency-domain warping, Gaussian speaker embeddings, and an SSC loss, yielding robust generalization to unseen speakers while maintaining speech intelligibility. On VCTK, LVC-VC achieves a favorable balance between voice style transfer quality and intelligibility, with competitive MOS and the lowest CER, demonstrating the practicality of end-to-end zero-shot VC with a compact model footprint.
Abstract
Zero-shot voice conversion is becoming an increasingly popular research topic, as it promises the ability to transform speech to sound like any speaker. However, relatively little work has been done on end-to-end methods for this task, which are appealing because they remove the need for a separate vocoder to generate audio from intermediate features. In this work, we propose LVC-VC, an end-to-end zero-shot voice conversion model that uses location-variable convolutions (LVCs) to jointly model the conversion and speech synthesis processes. LVC-VC utilizes carefully designed input features that have disentangled content and speaker information, and it uses a neural vocoder-like architecture that utilizes LVCs to efficiently combine them and perform voice conversion while directly synthesizing time domain audio. Experiments show that our model achieves especially well balanced performance between voice style transfer and speech intelligibility compared to several baselines.
