Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval

Yimin Deng; Huaizhen Tang; Xulong Zhang; Ning Cheng; Jing Xiao; Jianzong Wang

Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval

Yimin Deng, Huaizhen Tang, Xulong Zhang, Ning Cheng, Jing Xiao, Jianzong Wang

TL;DR

This paper tackles the problem of disentangling linguistic content from speaker style in voice conversion. It introduces CTVC, a framework that uses contrastive learning to align frame-level content with phoneme-level information and a time-invariant retrieval mechanism to capture global timbre, avoiding heavy reliance on pre-trained speaker models. The model optimizes a combined objective that includes reconstruction, content contrastivity, time-invariant style constraints, and adversarial speaker-identity removal, yielding improved speech quality and speaker similarity. Experiments on AISHELL-3 demonstrate that CTVC outperforms several baselines in both Many-to-Many and One-Shot VC tasks, with ablations confirming the importance of each component for disentanglement and timbre fidelity, suggesting strong practical potential for robust VC systems.

Abstract

Voice conversion refers to transferring speaker identity with well-preserved content. Better disentanglement of speech representations leads to better voice conversion. Recent studies have found that phonetic information from input audio has the potential ability to well represent content. Besides, the speaker-style modeling with pre-trained models making the process more complex. To tackle these issues, we introduce a new method named "CTVC" which utilizes disentangled speech representations with contrastive learning and time-invariant retrieval. Specifically, a similarity-based compression module is used to facilitate a more intimate connection between the frame-level hidden features and linguistic information at phoneme-level. Additionally, a time-invariant retrieval is proposed for timbre extraction based on multiple segmentations and mutual information. Experimental results demonstrate that "CTVC" outperforms previous studies and improves the sound quality and similarity of converted results.

Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval

TL;DR

Abstract

Paper Structure (12 sections, 9 equations, 3 figures, 2 tables)

This paper contains 12 sections, 9 equations, 3 figures, 2 tables.

Introduction
Methodology
Speaker-Independent Content Feature Discovery
Time-Invariant Retrieval for Speaker Representation
Training Strategy
Experiments
Datasets and Configurations
Comparison of VC Tasks
Evaluation of Speaker Similarity
Ablation Experiments
Conclusion
Acknowledgement

Figures (3)

Figure 1: The framework of "CTVC". $C_x$ is the content embedding that is generated by the content encoder while $S_x$ refers to the global speaker embedding. GRL denotes Gradient Reversal Layer. MI means Mutual Information. In compression module, the colors indicates frames belonging to different phonemes and the dash-lines indicates boundaries.
Figure 2: Different segment methods for style controlling
Figure 3: Objective evaluation results for Voice Conversion. F: Female; M: Male. Green groups are real speech. Red groups are synthesized speech from different models.

Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval

TL;DR

Abstract

Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (3)