EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning

Ziqi Liang; Jianzong Wang; Xulong Zhang; Yong Zhang; Ning Cheng; Jing Xiao

EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning

Ziqi Liang, Jianzong Wang, Xulong Zhang, Yong Zhang, Ning Cheng, Jing Xiao

TL;DR

EAD-VC targets unsupervised disentanglement of speech into content, pitch, rhythm, and timbre for voice conversion by avoiding hand-crafted bottlenecks. The approach builds a two-stage framework: stage 1 uses self-supervised encoders to produce $Z_c$, $Z_p$, and $Z_r$ from the original and augmented speech, while stage 2 freezes these encoders and employs a trainable bottleneck adaptor (BNA) optimized with a Mutual Information upper bound estimator IFUB to minimize cross-component information, complemented by a Joint Text-Guided Consistent (TGC) module that guides content extraction with text, ASR bottleneck features, an adversarial speaker classifier, and a timbre fusion mechanism. Key contributions include the IFUB-based MI minimization to decouple components, and the TGC module to curb timbre leakage and content inconsistency during VC. Experimental results on VCTK show improved disentanglement, naturalness, and speaker similarity over strong baselines, with better generalization to unseen speakers and robust one-shot VC performance, demonstrating practical impact for robust, bottleneck-free VC systems.

Abstract

Using unsupervised learning to disentangle speech into content, rhythm, pitch, and timbre for voice conversion has become a hot research topic. Existing works generally take into account disentangling speech components through human-crafted bottleneck features which can not achieve sufficient information disentangling, while pitch and rhythm may still be mixed together. There is a risk of information overlap in the disentangling process which results in less speech naturalness. To overcome such limits, we propose a two-stage model to disentangle speech representations in a self-supervised manner without a human-crafted bottleneck design, which uses the Mutual Information (MI) with the designed upper bound estimator (IFUB) to separate overlapping information between speech components. Moreover, we design a Joint Text-Guided Consistent (TGC) module to guide the extraction of speech content and eliminate timbre leakage issues. Experiments show that our model can achieve a better performance than the baseline, regarding disentanglement effectiveness, speech naturalness, and similarity. Audio samples can be found at https://largeaudiomodel.com/eadvc.

EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning

TL;DR

, and

from the original and augmented speech, while stage 2 freezes these encoders and employs a trainable bottleneck adaptor (BNA) optimized with a Mutual Information upper bound estimator IFUB to minimize cross-component information, complemented by a Joint Text-Guided Consistent (TGC) module that guides content extraction with text, ASR bottleneck features, an adversarial speaker classifier, and a timbre fusion mechanism. Key contributions include the IFUB-based MI minimization to decouple components, and the TGC module to curb timbre leakage and content inconsistency during VC. Experimental results on VCTK show improved disentanglement, naturalness, and speaker similarity over strong baselines, with better generalization to unseen speakers and robust one-shot VC performance, demonstrating practical impact for robust, bottleneck-free VC systems.

Abstract

Paper Structure (16 sections, 15 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 16 sections, 15 equations, 4 figures, 2 tables, 1 algorithm.

Introduction
Related work
Methodology
SSL-based speech disentanglement
Mutual information with IFUB estimator
Joint text-guided consistent learning
Experiments
Experiment setup
Evaluation
Subjective evaluation results
Objective evaluation results
Generalization to unseen speaker
Ablation study results
Conversion rate
Conclusions
...and 1 more sections

Figures (4)

Figure 1: Framework of EAD-VC, which shows the two stages of our method: (I) Train the encoder based on the data and its augmented versions to disentangle speech as (a). (II) Freeze encoders to extract $Z_{c}$, $Z_{p}$, and $Z_{r}$ in (b); Desired content embedding $\hat{Z_{c}}$ from phonemes, which is used to guide the content encoder training. $E_{ASR}$ is used to keep the content consistent after VC.
Figure 2: Scores pertaining to VC are indicated. F denotes Female, and M denotes Male. Different models are represented on the x-axis, while prediction scores are represented on the y-axis.
Figure 3: Timbre embedding visualization on One-Shot VC. (a) Liu et al. auto-disentangle; (b) EAD-VC.
Figure 4: Subjective conversion rate Evaluation. Each group encompasses three distinct subsections, which correspond to the conversion rates of pitch, rhythm, and timbre.

EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning

TL;DR

Abstract

EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)