Table of Contents
Fetching ...

CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

Yuke Li, Xinfa Zhu, Hanzhao Li, JiXun Yao, WenJie Tian, XiPeng Yang, YunLin Chen, Zhifei Li, Lei Xie

TL;DR

CoDiff-VC addresses zero-shot voice conversion by replacing cascading ASR-based bottleneck extraction with a codec-assisted diffusion framework. It uses a single-codebook codec to extract content, a reference encoder with Mix-Style Layer Normalization to reduce residual timbre, and a multi-scale timbre modeling strategy within a diffusion backbone, guided by dual classifier-free mechanisms. The approach yields higher speaker similarity and naturalness than strong baselines, with ablation studies confirming the importance of each component. While diffusion-based generation improves quality, inference remains slower, suggesting avenues for speedups in practical deployments.

Abstract

Zero-shot voice conversion (VC) aims to convert the original speaker's timbre to any target speaker while keeping the linguistic content. Current mainstream zero-shot voice conversion approaches depend on pre-trained recognition models to disentangle linguistic content and speaker representation. This results in a timbre residue within the decoupled linguistic content and inadequacies in speaker representation modeling. In this study, we propose CoDiff-VC, an end-to-end framework for zero-shot voice conversion that integrates a speech codec and a diffusion model to produce high-fidelity waveforms. Our approach involves employing a single-codebook codec to separate linguistic content from the source speech. To enhance content disentanglement, we introduce Mix-Style layer normalization (MSLN) to perturb the original timbre. Additionally, we incorporate a multi-scale speaker timbre modeling approach to ensure timbre consistency and improve voice detail similarity. To improve speech quality and speaker similarity, we introduce dual classifier-free guidance, providing both content and timbre guidance during the generation process. Objective and subjective experiments affirm that CoDiff-VC significantly improves speaker similarity, generating natural and higher-quality speech.

CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

TL;DR

CoDiff-VC addresses zero-shot voice conversion by replacing cascading ASR-based bottleneck extraction with a codec-assisted diffusion framework. It uses a single-codebook codec to extract content, a reference encoder with Mix-Style Layer Normalization to reduce residual timbre, and a multi-scale timbre modeling strategy within a diffusion backbone, guided by dual classifier-free mechanisms. The approach yields higher speaker similarity and naturalness than strong baselines, with ablation studies confirming the importance of each component. While diffusion-based generation improves quality, inference remains slower, suggesting avenues for speedups in practical deployments.

Abstract

Zero-shot voice conversion (VC) aims to convert the original speaker's timbre to any target speaker while keeping the linguistic content. Current mainstream zero-shot voice conversion approaches depend on pre-trained recognition models to disentangle linguistic content and speaker representation. This results in a timbre residue within the decoupled linguistic content and inadequacies in speaker representation modeling. In this study, we propose CoDiff-VC, an end-to-end framework for zero-shot voice conversion that integrates a speech codec and a diffusion model to produce high-fidelity waveforms. Our approach involves employing a single-codebook codec to separate linguistic content from the source speech. To enhance content disentanglement, we introduce Mix-Style layer normalization (MSLN) to perturb the original timbre. Additionally, we incorporate a multi-scale speaker timbre modeling approach to ensure timbre consistency and improve voice detail similarity. To improve speech quality and speaker similarity, we introduce dual classifier-free guidance, providing both content and timbre guidance during the generation process. Objective and subjective experiments affirm that CoDiff-VC significantly improves speaker similarity, generating natural and higher-quality speech.

Paper Structure

This paper contains 13 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overall architecture of the CoDiff-VC. The content module within the green dashed line is used to extract linguistic content from the source speech, while we employ a multi-scale timbre modeling module to capture the details of speaker timbre. The diffusion module within the purple dashed line reconstructs the speech waveform conditioned on the linguistic content and speaker timbre.
  • Figure 2: The structure of U-net block.
  • Figure 3: T-SNE visualization of the coarse-grained timbre in multi-speakers settings.