Table of Contents
Fetching ...

VoicePrompter: Robust Zero-Shot Voice Conversion with Voice Prompt and Conditional Flow Matching

Ha-Yeong Choi, Jaehan Park

TL;DR

VoicePrompter tackles zero-shot voice conversion by enabling in-context learning through explicit voice prompts and a robust DiT-based conditional flow matching framework. It factorizes speech into content, pitch, and speaker factors via a speech factorizing encoder, and conditions a DiT-based CFM decoder on these factors plus voice prompts, aided by AdaLN-Sep and latent mixup. Empirical results on LibriTTS and VCTK demonstrate improved speaker similarity, intelligibility, and audio quality over strong baselines, with effective one-pass generation and robustness to mismatched training-inference conditions. The work highlights the potential of prompting and factorized conditioning to advance practical, high-fidelity zero-shot VC, especially when paired with scalable backbone models.

Abstract

Despite remarkable advancements in recent voice conversion (VC) systems, enhancing speaker similarity in zero-shot scenarios remains challenging. This challenge arises from the difficulty of generalizing and adapting speaker characteristics in speech within zero-shot environments, which is further complicated by mismatch between the training and inference processes. To address these challenges, we propose VoicePrompter, a robust zero-shot VC model that leverages in-context learning with voice prompts. VoicePrompter is composed of (1) a factorization method that disentangles speech components and (2) a DiT-based conditional flow matching (CFM) decoder that conditions on these factorized features and voice prompts. Additionally, (3) latent mixup is used to enhance in-context learning by combining various speaker features. This approach improves speaker similarity and naturalness in zero-shot VC by applying mixup to latent representations. Experimental results demonstrate that VoicePrompter outperforms existing zero-shot VC systems in terms of speaker similarity, speech intelligibility, and audio quality. Our demo is available at \url{https://hayeong0.github.io/VoicePrompter-demo/}.

VoicePrompter: Robust Zero-Shot Voice Conversion with Voice Prompt and Conditional Flow Matching

TL;DR

VoicePrompter tackles zero-shot voice conversion by enabling in-context learning through explicit voice prompts and a robust DiT-based conditional flow matching framework. It factorizes speech into content, pitch, and speaker factors via a speech factorizing encoder, and conditions a DiT-based CFM decoder on these factors plus voice prompts, aided by AdaLN-Sep and latent mixup. Empirical results on LibriTTS and VCTK demonstrate improved speaker similarity, intelligibility, and audio quality over strong baselines, with effective one-pass generation and robustness to mismatched training-inference conditions. The work highlights the potential of prompting and factorized conditioning to advance practical, high-fidelity zero-shot VC, especially when paired with scalable backbone models.

Abstract

Despite remarkable advancements in recent voice conversion (VC) systems, enhancing speaker similarity in zero-shot scenarios remains challenging. This challenge arises from the difficulty of generalizing and adapting speaker characteristics in speech within zero-shot environments, which is further complicated by mismatch between the training and inference processes. To address these challenges, we propose VoicePrompter, a robust zero-shot VC model that leverages in-context learning with voice prompts. VoicePrompter is composed of (1) a factorization method that disentangles speech components and (2) a DiT-based conditional flow matching (CFM) decoder that conditions on these factorized features and voice prompts. Additionally, (3) latent mixup is used to enhance in-context learning by combining various speaker features. This approach improves speaker similarity and naturalness in zero-shot VC by applying mixup to latent representations. Experimental results demonstrate that VoicePrompter outperforms existing zero-shot VC systems in terms of speaker similarity, speech intelligibility, and audio quality. Our demo is available at \url{https://hayeong0.github.io/VoicePrompter-demo/}.

Paper Structure

This paper contains 18 sections, 4 equations, 2 figures, 5 tables, 1 algorithm.

Figures (2)

  • Figure 1: Overall architecture of VoicePrompter. (a) Training phase; (b) Inference phase; (c) Speech Factorizing Encoder
  • Figure 2: Comparison of (a) the original DiT block with adaLN-Zero and (b) the proposed DiT block with adaLN-Sep.