AVENet: Disentangling Features by Approximating Average Features for Voice Conversion

Wenyu Wang; Yiquan Zhou; Jihua Zhu; Hongwu Ding; Jiacheng Xu; Shihao Li

AVENet: Disentangling Features by Approximating Average Features for Voice Conversion

Wenyu Wang, Yiquan Zhou, Jihua Zhu, Hongwu Ding, Jiacheng Xu, Shihao Li

TL;DR

This work addresses timbre–content disentanglement in voice conversion by introducing the average feature, a frame-level content representation derived from frame-aligned parallel speech (FAPS). It proposes AVENet to learn to map real speech features to average-feature–like representations, enabling robust content preservation while suppressing speaker timbre, even when FAPS data is unavailable. FAPS data are synthetically generated via a modified VITS pipeline, and AVENet is trained with an average reconstruction loss and a positive contrastive loss to encourage content alignment and reduce timbre leakage. When integrated into a VITS-based VC system, AVENet-derived features yield improved speaker similarity and naturalness across multiple SSL feature types, demonstrating effective disentanglement and practical benefits for VC.

Abstract

Voice conversion (VC) has made progress in feature disentanglement, but it is still difficult to balance timbre and content information. This paper evaluates the pre-trained model features commonly used in voice conversion, and proposes an innovative method for disentangling speech feature representations. Specifically, we first propose an ideal content feature, referred to as the average feature, which is calculated by averaging the features within frame-level aligned parallel speech (FAPS) data. For generating FAPS data, we utilize a technique that involves freezing the duration predictor in a Text-to-Speech system and manipulating speaker embedding. To fit the average feature on traditional VC datasets, we then design the AVENet to take features as input and generate closely matching average features. Experiments are conducted on the performance of AVENet-extracted features within a VC system. The experimental results demonstrate its superiority over multiple current speech feature disentangling methods. These findings affirm the effectiveness of our disentanglement approach.

AVENet: Disentangling Features by Approximating Average Features for Voice Conversion

TL;DR

Abstract

AVENet: Disentangling Features by Approximating Average Features for Voice Conversion

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)