Table of Contents
Fetching ...

Zero-shot Voice Conversion with Diffusion Transformers

Songting Liu

TL;DR

Seed-VC tackles zero-shot voice conversion by addressing timbre leakage, timbre representation, and training–inference misalignment. It introduces an external timbre shifter during training and employs a diffusion transformer that uses the full reference context to capture fine-grained timbre features. The framework achieves state-of-the-art performance on zero-shot spoken VC against strong baselines and extends to zero-shot singing VC with F0 conditioning, maintaining high speaker similarity and intelligibility. Ablation and qualitative analyses highlight the benefits of timbre shifting, full-reference context, and robust timbre representation for generalization to unseen speakers and singing voices.

Abstract

Zero-shot voice conversion aims to transform a source speech utterance to match the timbre of a reference speech from an unseen speaker. Traditional approaches struggle with timbre leakage, insufficient timbre representation, and mismatches between training and inference tasks. We propose Seed-VC, a novel framework that addresses these issues by introducing an external timbre shifter during training to perturb the source speech timbre, mitigating leakage and aligning training with inference. Additionally, we employ a diffusion transformer that leverages the entire reference speech context, capturing fine-grained timbre features through in-context learning. Experiments demonstrate that Seed-VC outperforms strong baselines like OpenVoice and CosyVoice, achieving higher speaker similarity and lower word error rates in zero-shot voice conversion tasks. We further extend our approach to zero-shot singing voice conversion by incorporating fundamental frequency (F0) conditioning, resulting in comparative performance to current state-of-the-art methods. Our findings highlight the effectiveness of Seed-VC in overcoming core challenges, paving the way for more accurate and versatile voice conversion systems.

Zero-shot Voice Conversion with Diffusion Transformers

TL;DR

Seed-VC tackles zero-shot voice conversion by addressing timbre leakage, timbre representation, and training–inference misalignment. It introduces an external timbre shifter during training and employs a diffusion transformer that uses the full reference context to capture fine-grained timbre features. The framework achieves state-of-the-art performance on zero-shot spoken VC against strong baselines and extends to zero-shot singing VC with F0 conditioning, maintaining high speaker similarity and intelligibility. Ablation and qualitative analyses highlight the benefits of timbre shifting, full-reference context, and robust timbre representation for generalization to unseen speakers and singing voices.

Abstract

Zero-shot voice conversion aims to transform a source speech utterance to match the timbre of a reference speech from an unseen speaker. Traditional approaches struggle with timbre leakage, insufficient timbre representation, and mismatches between training and inference tasks. We propose Seed-VC, a novel framework that addresses these issues by introducing an external timbre shifter during training to perturb the source speech timbre, mitigating leakage and aligning training with inference. Additionally, we employ a diffusion transformer that leverages the entire reference speech context, capturing fine-grained timbre features through in-context learning. Experiments demonstrate that Seed-VC outperforms strong baselines like OpenVoice and CosyVoice, achieving higher speaker similarity and lower word error rates in zero-shot voice conversion tasks. We further extend our approach to zero-shot singing voice conversion by incorporating fundamental frequency (F0) conditioning, resulting in comparative performance to current state-of-the-art methods. Our findings highlight the effectiveness of Seed-VC in overcoming core challenges, paving the way for more accurate and versatile voice conversion systems.

Paper Structure

This paper contains 46 sections, 6 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Architectural Detail about U-Net style skip connections and timestamp condition.
  • Figure 2: Training Pipeline. A random segment is set as timbre prompt. Prompt component contains semantic feature and acoustic feature from original audio, while target component contains semantic feature from timbre-shifted audio. Loss is only calculated on target component.
  • Figure 3: Inference Pipeline. Corresponds to training pipeline, reference audio plays the role as timbre enrollment.