Takin-VC: Expressive Zero-Shot Voice Conversion via Adaptive Hybrid Content Encoding and Enhanced Timbre Modeling
Yuguang Yang, Yu Pan, Jixun Yao, Xiang Zhang, Jianhao Ye, Hongbin Zhou, Lei Xie, Lei Ma, Jianjun Zhao
TL;DR
Takin-VC advances expressive zero-shot voice conversion by integrating an adaptive hybrid content encoder that blends PPGs and quantized SSL features to capture linguistic content while enriching paralinguistic cues. It introduces memory-augmented and context-aware timbre modeling to generate high-quality target timbre inputs and effectively fuse them with source content for a conditional flow matching decoder, enabling stable, real-time synthesis. Empirical results on large-scale multilingual and LibriTTS-like datasets show consistent gains in naturalness, expressiveness, and speaker similarity over state-of-the-art VC baselines, with ablations confirming the contribution of each module. The work promises practical impact for high-fidelity, zero-shot VC with preserved expressiveness across unseen speakers and languages.
Abstract
Expressive zero-shot voice conversion (VC) is a critical and challenging task that aims to transform the source timbre into an arbitrary unseen speaker while preserving the original content and expressive qualities. Despite recent progress in zero-shot VC, there remains considerable potential for improvements in speaker similarity and speech naturalness. Moreover, existing zero-shot VC systems struggle to fully reproduce paralinguistic information in highly expressive speech, such as breathing, crying, and emotional nuances, limiting their practical applicability. To address these issues, we propose Takin-VC, a novel expressive zero-shot VC framework via adaptive hybrid content encoding and memory-augmented context-aware timbre modeling. Specifically, we introduce an innovative hybrid content encoder that incorporates an adaptive fusion module, capable of effectively integrating quantized features of the pre-trained WavLM and HybridFormer in an implicit manner, so as to extract precise linguistic features while enriching paralinguistic elements. For timbre modeling, we propose advanced memory-augmented and context-aware modules to generate high-quality target timbre features and fused representations that seamlessly align source content with target timbre. To enhance real-time performance, we advocate a conditional flow matching model to reconstruct the Mel-spectrogram of the source speech. Experimental results show that our Takin-VC consistently surpasses state-of-the-art VC systems, achieving notable improvements in terms of speech naturalness, speech expressiveness, and speaker similarity, while offering enhanced inference speed.
