Table of Contents
Fetching ...

ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training

Xinfa Zhu, Lei He, Yujia Xiao, Xi Wang, Xu Tan, Sheng Zhao, Lei Xie

TL;DR

ZSVC addresses zero-shot style voice conversion by disentangling speaking style from speaker timbre and leveraging in-context learning via speech prompting. It combines a speech codec with a SoundStorm-based latent diffusion model, an information bottleneck to isolate style, and UMAdaIN to perturb timbre in prompts, reinforced by adversarial training to enhance in-context style transfer. The approach is validated on a 44k-hour English dataset, showing strong style similarity while maintaining naturalness and speaker fidelity, and demonstrating robust zero-shot capability without explicit style annotations. This work advances practical, flexible voice conversion for applications like dubbing and broadcasting by enabling diverse speaking styles without parallel data or target-style labels.

Abstract

Style voice conversion aims to transform the speaking style of source speech into a desired style while keeping the original speaker's identity. However, previous style voice conversion approaches primarily focus on well-defined domains such as emotional aspects, limiting their practical applications. In this study, we present ZSVC, a novel Zero-shot Style Voice Conversion approach that utilizes a speech codec and a latent diffusion model with speech prompting mechanism to facilitate in-context learning for speaking style conversion. To disentangle speaking style and speaker timbre, we introduce information bottleneck to filter speaking style in the source speech and employ Uncertainty Modeling Adaptive Instance Normalization (UMAdaIN) to perturb the speaker timbre in the style prompt. Moreover, we propose a novel adversarial training strategy to enhance in-context learning and improve style similarity. Experiments conducted on 44,000 hours of speech data demonstrate the superior performance of ZSVC in generating speech with diverse speaking styles in zero-shot scenarios.

ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training

TL;DR

ZSVC addresses zero-shot style voice conversion by disentangling speaking style from speaker timbre and leveraging in-context learning via speech prompting. It combines a speech codec with a SoundStorm-based latent diffusion model, an information bottleneck to isolate style, and UMAdaIN to perturb timbre in prompts, reinforced by adversarial training to enhance in-context style transfer. The approach is validated on a 44k-hour English dataset, showing strong style similarity while maintaining naturalness and speaker fidelity, and demonstrating robust zero-shot capability without explicit style annotations. This work advances practical, flexible voice conversion for applications like dubbing and broadcasting by enabling diverse speaking styles without parallel data or target-style labels.

Abstract

Style voice conversion aims to transform the speaking style of source speech into a desired style while keeping the original speaker's identity. However, previous style voice conversion approaches primarily focus on well-defined domains such as emotional aspects, limiting their practical applications. In this study, we present ZSVC, a novel Zero-shot Style Voice Conversion approach that utilizes a speech codec and a latent diffusion model with speech prompting mechanism to facilitate in-context learning for speaking style conversion. To disentangle speaking style and speaker timbre, we introduce information bottleneck to filter speaking style in the source speech and employ Uncertainty Modeling Adaptive Instance Normalization (UMAdaIN) to perturb the speaker timbre in the style prompt. Moreover, we propose a novel adversarial training strategy to enhance in-context learning and improve style similarity. Experiments conducted on 44,000 hours of speech data demonstrate the superior performance of ZSVC in generating speech with diverse speaking styles in zero-shot scenarios.
Paper Structure (13 sections, 6 equations, 3 figures, 3 tables)

This paper contains 13 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overall framework of ZSVC. 'IC' means in-context learning through speech prompting mechanism
  • Figure 2: The detailed architecture of proposed ZSVC. The dashed line means only available in training.
  • Figure 3: T-SNE visualization of speaker representations (left) and emotion representations (right). The circle represents the original speech, and the star represents the converted speech.