Table of Contents
Fetching ...

MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances

Junhyeok Lee, Helin Wang, Yaohan Guan, Thomas Thebaud, Laureano Moro-Velazquez, Jesús Villalba, Najim Dehak

TL;DR

MaskVCT addresses zero-shot voice conversion with multi-factor controllability by integrating linguistic, pitch, and speaker conditioning into a single masked codec language model operating on residual-vector-quantized tokens. It introduces triple classifier-free guidance with coefficients $\omega_{\text{all}}$, $\omega_{\text{spk}}$, and $\omega_{\text{ling}}$, and supports both continuous and discrete linguistic representations via SylBoost-based tokens, enabling a tunable balance between intelligibility and speaker fidelity. Two inference modes, MaskVCT-All (pitch-aware, higher intelligibility) and MaskVCT-Spk (speaker-focused), demonstrate strong target speaker and accent similarity with competitive WER/CER and MOS scores across LibriTTS-R and L2-ARCTIC evaluations. The approach advances practical VC by enabling flexible, at-inference control over content, pitch, and identity, though it notes trade-offs in intelligibility with certain syllabic representations and outlines future work to optimize the quantized linguistic representation with masked training. Overall, MaskVCT offers a scalable, controllable zero-shot VC solution with strong empirical performance and actionable guidance for deployment.

Abstract

We introduce MaskVCT, a zero-shot voice conversion (VC) model that offers multi-factor controllability through multiple classifier-free guidances (CFGs). While previous VC models rely on a fixed conditioning scheme, MaskVCT integrates diverse conditions in a single model. To further enhance robustness and control, the model can leverage continuous or quantized linguistic features to enhance intelligibility and speaker similarity, and can use or omit pitch contour to control prosody. These choices allow users to seamlessly balance speaker identity, linguistic content, and prosodic factors in a zero-shot VC setting. Extensive experiments demonstrate that MaskVCT achieves the best target speaker and accent similarities while obtaining competitive word and character error rates compared to existing baselines. Audio samples are available at https://maskvct.github.io/.

MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances

TL;DR

MaskVCT addresses zero-shot voice conversion with multi-factor controllability by integrating linguistic, pitch, and speaker conditioning into a single masked codec language model operating on residual-vector-quantized tokens. It introduces triple classifier-free guidance with coefficients , , and , and supports both continuous and discrete linguistic representations via SylBoost-based tokens, enabling a tunable balance between intelligibility and speaker fidelity. Two inference modes, MaskVCT-All (pitch-aware, higher intelligibility) and MaskVCT-Spk (speaker-focused), demonstrate strong target speaker and accent similarity with competitive WER/CER and MOS scores across LibriTTS-R and L2-ARCTIC evaluations. The approach advances practical VC by enabling flexible, at-inference control over content, pitch, and identity, though it notes trade-offs in intelligibility with certain syllabic representations and outlines future work to optimize the quantized linguistic representation with masked training. Overall, MaskVCT offers a scalable, controllable zero-shot VC solution with strong empirical performance and actionable guidance for deployment.

Abstract

We introduce MaskVCT, a zero-shot voice conversion (VC) model that offers multi-factor controllability through multiple classifier-free guidances (CFGs). While previous VC models rely on a fixed conditioning scheme, MaskVCT integrates diverse conditions in a single model. To further enhance robustness and control, the model can leverage continuous or quantized linguistic features to enhance intelligibility and speaker similarity, and can use or omit pitch contour to control prosody. These choices allow users to seamlessly balance speaker identity, linguistic content, and prosodic factors in a zero-shot VC setting. Extensive experiments demonstrate that MaskVCT achieves the best target speaker and accent similarities while obtaining competitive word and character error rates compared to existing baselines. Audio samples are available at https://maskvct.github.io/.

Paper Structure

This paper contains 17 sections, 3 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Overall system description of MaskVCT. We perform column-wise addition of the embeddings and feed the result into MaskVCT. We employ 9 codebooks for DAC, but display only 2 here for brevity. All models operate at 50 Hz frame rate.