Residual Speaker Representation for One-Shot Voice Conversion

Le Xu; Jiangyan Yi; Tao Wang; Yong Ren; Rongxiu Zhong; Zhengqi Wen; Jianhua Tao

Residual Speaker Representation for One-Shot Voice Conversion

Le Xu, Jiangyan Yi, Tao Wang, Yong Ren, Rongxiu Zhong, Zhengqi Wen, Jianhua Tao

TL;DR

This work tackles the robustness gap in one-shot voice conversion for unseen speakers and the limited timbre control by introducing the Residual Speaker Module (RSM). RSM uses multi-layer residual approximations and token-based CrossAttention to encode speaker timbre as a controllable, fixed-content representation that mitigates OOD issues. Empirical results on VCTK and LibriTTS show improved objective metrics (lower WER/CER) and stronger speaker similarity, with ablations confirming the benefit of multi-layer residual tokens for controllability. The proposed approach offers practical benefits for robust, customizable voice conversion and lays groundwork for finer-grained timbre manipulation in speech synthesis.

Abstract

Recently, there have been significant advancements in voice conversion, resulting in high-quality performance. However, there are still two critical challenges in this field. Firstly, current voice conversion methods have limited robustness when encountering unseen speakers. Secondly, they also have limited ability to control timbre representation. To address these challenges, this paper presents a novel approach that leverages tokens of multi-layer residual approximations to enhance robustness when dealing with unseen speakers, called the residual speaker module. Introducing multi-layer approximations facilitates the separation of information from the timbre, enabling effective control over timbre in voice conversion. The proposed method outperforms baselines in subjective and objective evaluations, demonstrating superior performance and increased robustness. Our demo page is publicly available.

Residual Speaker Representation for One-Shot Voice Conversion

TL;DR

Abstract

Paper Structure (17 sections, 2 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 17 sections, 2 equations, 4 figures, 3 tables, 1 algorithm.

Introduction
Method
Residual Speaker Module
Speaker Encoder
Residual Representation Layer
Voice Conversion
Experiments
Datasets
Implementation Details
Baseline
Evaluation Metrics
Results and Discussion
Objective Evaluation
Subjective Evaluation
Voice Control
...and 2 more sections

Figures (4)

Figure 1: Framework of the voice conversion and speaker representation control
Figure 2: Pipeline of the voice control
Figure 3: Mel-spectrogram of synthesized speech after replacing speaker representations extracted by RSM layer by layer
Figure 4: Visualization of the codebook of RSM (frist line) and ablation study (second line)

Residual Speaker Representation for One-Shot Voice Conversion

TL;DR

Abstract

Residual Speaker Representation for One-Shot Voice Conversion

Authors

TL;DR

Abstract

Table of Contents

Figures (4)