Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation

Yimin Deng; Jianzong Wang; Xulong Zhang; Ning Cheng; Jing Xiao

Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation

Yimin Deng, Jianzong Wang, Xulong Zhang, Ning Cheng, Jing Xiao

TL;DR

This work tackles timbre leakage and underutilized prosody in self-supervised speech representations for expressive voice conversion. It introduces SAVC, a framework that uses Soft speech units from HuBert-Soft, an attribute encoder, Adversarial Style Augmentation, and teacher-student prosody distillation to disentangle content and prosody while controlling timbre. Empirical results on VCTK and Emotional Speaker Dataset show improved intelligibility, naturalness, and timbre/prosody similarity, including in zero-shot conditions, demonstrating robust expressiveness gains. The approach offers a practical, SSL-based solution for high-quality, expressive VC with reduced speaker leakage and better prosodic modeling.

Abstract

Voice conversion is the task to transform voice characteristics of source speech while preserving content information. Nowadays, self-supervised representation learning models are increasingly utilized in content extraction. However, in these representations, a lot of hidden speaker information leads to timbre leakage while the prosodic information of hidden units lacks use. To address these issues, we propose a novel framework for expressive voice conversion called "SAVC" based on soft speech units from HuBert-soft. Taking soft speech units as input, we design an attribute encoder to extract content and prosody features respectively. Specifically, we first introduce statistic perturbation imposed by adversarial style augmentation to eliminate speaker information. Then the prosody is implicitly modeled on soft speech units with knowledge distillation. Experiment results show that the intelligibility and naturalness of converted speech outperform previous work.

Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation

TL;DR

Abstract

Paper Structure (23 sections, 5 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 23 sections, 5 equations, 4 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Voice Conversion
Timbre Leakage
Prosody Modeling
Methodology
Model Structure
Speaker Irrelevant Attributes Extraction
Feature statistic modeling
Adversarial Style Augmentation module
Prosody Modeling
Prosody Distillation
Expressiveness Constraints
Training Strategy
Architecture of Attribute Encoder
...and 8 more sections

Figures (4)

Figure 1: The framework of SAVC. $Z_{F_c}$, $Z_{F_P}$ are the content embedding and prosody embedding extracted by the attribute encoder respectively. $Z_P$ is the prosody embedding extracted by the teacher model. Speaker embedding $S_X$ is extracted by a pre-trained model.
Figure 2: Feature statistic based Adversarial Style Augmentation (ASA) module.
Figure 3: Architecture of attribute encoder
Figure 4: Listening test results of ablation studies. (M1: w/o ASA; M2: w/o su; M3: w/o pm)

Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation

TL;DR

Abstract

Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)