Table of Contents
Fetching ...

Residual Speaker Representation for One-Shot Voice Conversion

Le Xu, Jiangyan Yi, Tao Wang, Yong Ren, Rongxiu Zhong, Zhengqi Wen, Jianhua Tao

TL;DR

This work tackles the robustness gap in one-shot voice conversion for unseen speakers and the limited timbre control by introducing the Residual Speaker Module (RSM). RSM uses multi-layer residual approximations and token-based CrossAttention to encode speaker timbre as a controllable, fixed-content representation that mitigates OOD issues. Empirical results on VCTK and LibriTTS show improved objective metrics (lower WER/CER) and stronger speaker similarity, with ablations confirming the benefit of multi-layer residual tokens for controllability. The proposed approach offers practical benefits for robust, customizable voice conversion and lays groundwork for finer-grained timbre manipulation in speech synthesis.

Abstract

Recently, there have been significant advancements in voice conversion, resulting in high-quality performance. However, there are still two critical challenges in this field. Firstly, current voice conversion methods have limited robustness when encountering unseen speakers. Secondly, they also have limited ability to control timbre representation. To address these challenges, this paper presents a novel approach that leverages tokens of multi-layer residual approximations to enhance robustness when dealing with unseen speakers, called the residual speaker module. Introducing multi-layer approximations facilitates the separation of information from the timbre, enabling effective control over timbre in voice conversion. The proposed method outperforms baselines in subjective and objective evaluations, demonstrating superior performance and increased robustness. Our demo page is publicly available.

Residual Speaker Representation for One-Shot Voice Conversion

TL;DR

This work tackles the robustness gap in one-shot voice conversion for unseen speakers and the limited timbre control by introducing the Residual Speaker Module (RSM). RSM uses multi-layer residual approximations and token-based CrossAttention to encode speaker timbre as a controllable, fixed-content representation that mitigates OOD issues. Empirical results on VCTK and LibriTTS show improved objective metrics (lower WER/CER) and stronger speaker similarity, with ablations confirming the benefit of multi-layer residual tokens for controllability. The proposed approach offers practical benefits for robust, customizable voice conversion and lays groundwork for finer-grained timbre manipulation in speech synthesis.

Abstract

Recently, there have been significant advancements in voice conversion, resulting in high-quality performance. However, there are still two critical challenges in this field. Firstly, current voice conversion methods have limited robustness when encountering unseen speakers. Secondly, they also have limited ability to control timbre representation. To address these challenges, this paper presents a novel approach that leverages tokens of multi-layer residual approximations to enhance robustness when dealing with unseen speakers, called the residual speaker module. Introducing multi-layer approximations facilitates the separation of information from the timbre, enabling effective control over timbre in voice conversion. The proposed method outperforms baselines in subjective and objective evaluations, demonstrating superior performance and increased robustness. Our demo page is publicly available.
Paper Structure (17 sections, 2 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 17 sections, 2 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Framework of the voice conversion and speaker representation control
  • Figure 2: Pipeline of the voice control
  • Figure 3: Mel-spectrogram of synthesized speech after replacing speaker representations extracted by RSM layer by layer
  • Figure 4: Visualization of the codebook of RSM (frist line) and ablation study (second line)