Table of Contents
Fetching ...

Generating Novel and Realistic Speakers for Voice Conversion

Meiying Melissa Chen, Zhenyu Wang, Zhiyao Duan

TL;DR

This paper addresses the limitation of voice-conversion (VC) systems that rely on target utterances by introducing SpeakerVAE, a lightweight, plug-in method that generates novel speaker timbres through a deep hierarchical NVAE. By modeling the speaker embedding space with NVAE and adapting it to 1D embeddings, the approach enables sampling of unseen yet natural-sounding speakers without retraining base VC models. Integrated with FACodec and CosyVoice2, SpeakerVAE trains on real speaker embeddings and yields high-fidelity, diverse timbres that preserve intelligibility (low WER/CER) and perceptual naturalness (UTMOSv2) while expanding the range of usable voices. The results demonstrate the method’s efficiency, compatibility with multiple VC backends, and potential for controlled generation in future work.

Abstract

Voice conversion models modify timbre while preserving paralinguistic features, enabling applications like dubbing and identity protection. However, most VC systems require access to target utterances, limiting their use when target data is unavailable or when users desire conversion to entirely novel, unseen voices. To address this, we introduce a lightweight method SpeakerVAE to generate novel speakers for VC. Our approach uses a deep hierarchical variational autoencoder to model the speaker timbre space. By sampling from the trained model, we generate novel speaker representations for voice synthesis in a VC pipeline. The proposed method is a flexible plug-in module compatible with various VC models, without co-training or fine-tuning of the base VC system. We evaluated our approach with state-of-the-art VC models: FACodec and CosyVoice2. The results demonstrate that our method successfully generates novel, unseen speakers with quality comparable to that of the training speakers.

Generating Novel and Realistic Speakers for Voice Conversion

TL;DR

This paper addresses the limitation of voice-conversion (VC) systems that rely on target utterances by introducing SpeakerVAE, a lightweight, plug-in method that generates novel speaker timbres through a deep hierarchical NVAE. By modeling the speaker embedding space with NVAE and adapting it to 1D embeddings, the approach enables sampling of unseen yet natural-sounding speakers without retraining base VC models. Integrated with FACodec and CosyVoice2, SpeakerVAE trains on real speaker embeddings and yields high-fidelity, diverse timbres that preserve intelligibility (low WER/CER) and perceptual naturalness (UTMOSv2) while expanding the range of usable voices. The results demonstrate the method’s efficiency, compatibility with multiple VC backends, and potential for controlled generation in future work.

Abstract

Voice conversion models modify timbre while preserving paralinguistic features, enabling applications like dubbing and identity protection. However, most VC systems require access to target utterances, limiting their use when target data is unavailable or when users desire conversion to entirely novel, unseen voices. To address this, we introduce a lightweight method SpeakerVAE to generate novel speakers for VC. Our approach uses a deep hierarchical variational autoencoder to model the speaker timbre space. By sampling from the trained model, we generate novel speaker representations for voice synthesis in a VC pipeline. The proposed method is a flexible plug-in module compatible with various VC models, without co-training or fine-tuning of the base VC system. We evaluated our approach with state-of-the-art VC models: FACodec and CosyVoice2. The results demonstrate that our method successfully generates novel, unseen speakers with quality comparable to that of the training speakers.

Paper Structure

This paper contains 18 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: SpeakerVAE overview. Left side shows the architecture and training process. Right side depicts the inference pipeline.
  • Figure 2: Audio quality metrics.
  • Figure 3: UMAP visualization of the training and generated embeddings.