Residual Speaker Representation for One-Shot Voice Conversion
Le Xu, Jiangyan Yi, Tao Wang, Yong Ren, Rongxiu Zhong, Zhengqi Wen, Jianhua Tao
TL;DR
This work tackles the robustness gap in one-shot voice conversion for unseen speakers and the limited timbre control by introducing the Residual Speaker Module (RSM). RSM uses multi-layer residual approximations and token-based CrossAttention to encode speaker timbre as a controllable, fixed-content representation that mitigates OOD issues. Empirical results on VCTK and LibriTTS show improved objective metrics (lower WER/CER) and stronger speaker similarity, with ablations confirming the benefit of multi-layer residual tokens for controllability. The proposed approach offers practical benefits for robust, customizable voice conversion and lays groundwork for finer-grained timbre manipulation in speech synthesis.
Abstract
Recently, there have been significant advancements in voice conversion, resulting in high-quality performance. However, there are still two critical challenges in this field. Firstly, current voice conversion methods have limited robustness when encountering unseen speakers. Secondly, they also have limited ability to control timbre representation. To address these challenges, this paper presents a novel approach that leverages tokens of multi-layer residual approximations to enhance robustness when dealing with unseen speakers, called the residual speaker module. Introducing multi-layer approximations facilitates the separation of information from the timbre, enabling effective control over timbre in voice conversion. The proposed method outperforms baselines in subjective and objective evaluations, demonstrating superior performance and increased robustness. Our demo page is publicly available.
