EZ-VC: Easy Zero-shot Any-to-Any Voice Conversion
Advait Joglekar, Divyanshu Singh, Rooshil Rohit Bhatia, S. Umesh
TL;DR
The paper tackles zero-shot voice conversion, especially cross-lingual generalization to unseen languages and accents. It introduces EZ-VC, a textless, self-supervised architecture that uses discrete speech units from a multilingual SSL encoder and a non-autoregressive diffusion-transformer-based flow-matching decoder, trained without labeled data and without multiple encoders. Key contributions include (i) speech-to-units with Xeus and 500-cluster quantization at the 14th layer, (ii) units-to-speech with F5-TTS, trained via an infilling objective, and (iii) comprehensive evaluation showing state-of-the-art naturalness and speaker similarity across seen and unseen languages. The approach offers a simpler, faster VC pipeline with strong cross-lingual generalization, though it relies on encoder quality and has high compute requirements.
Abstract
Voice Conversion research in recent times has increasingly focused on improving the zero-shot capabilities of existing methods. Despite remarkable advancements, current architectures still tend to struggle in zero-shot cross-lingual settings. They are also often unable to generalize for speakers of unseen languages and accents. In this paper, we adopt a simple yet effective approach that combines discrete speech representations from self-supervised models with a non-autoregressive Diffusion-Transformer based conditional flow matching speech decoder. We show that this architecture allows us to train a voice-conversion model in a purely textless, self-supervised fashion. Our technique works without requiring multiple encoders to disentangle speech features. Our model also manages to excel in zero-shot cross-lingual settings even for unseen languages. For Demo: https://ez-vc.github.io/EZ-VC-Demo/
