Table of Contents
Fetching ...

vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

Yiwei Guo, Zhihan Li, Junjie Li, Chenpeng Du, Hankun Wang, Shuai Wang, Xie Chen, Kai Yu

TL;DR

vec2wav 2.0 reframes voice conversion as a prompted vocoding task using discrete SSL content tokens and WavLM timbre prompts, integrated via a Conformer frontend and an adaptive BigVGAN generator. It introduces an adaptive Snake activation that modulates waveform frequency and magnitude based on target timbre, enabling timbre control without explicit speaker disentanglement. The method delivers superior audio quality and speaker similarity for English any-to-any VC and shows competitive cross-lingual VC with monolingual training data. This work suggests that speaker timbre can be effectively manipulated through discrete token vocoders, offering a scalable path toward LLM-driven speech generation.

Abstract

We propose a new speech discrete token vocoder, vec2wav 2.0, which advances voice conversion (VC). We use discrete tokens from speech self-supervised models as the content features of source speech, and treat VC as a prompted vocoding task. To amend the loss of speaker timbre in the content tokens, vec2wav 2.0 utilizes the WavLM features to provide strong timbre-dependent information. A novel adaptive Snake activation function is proposed to better incorporate timbre into the waveform reconstruction process. In this way, vec2wav 2.0 learns to alter the speaker timbre appropriately given different reference prompts. Also, no supervised data is required for vec2wav 2.0 to be effectively trained. Experimental results demonstrate that vec2wav 2.0 outperforms all other baselines to a considerable margin in terms of audio quality and speaker similarity in any-to-any VC. Ablation studies verify the effects made by the proposed techniques. Moreover, vec2wav 2.0 achieves competitive cross-lingual VC even only trained on monolingual corpus. Thus, vec2wav 2.0 shows timbre can potentially be manipulated only by speech token vocoders, pushing the frontiers of VC and speech synthesis.

vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

TL;DR

vec2wav 2.0 reframes voice conversion as a prompted vocoding task using discrete SSL content tokens and WavLM timbre prompts, integrated via a Conformer frontend and an adaptive BigVGAN generator. It introduces an adaptive Snake activation that modulates waveform frequency and magnitude based on target timbre, enabling timbre control without explicit speaker disentanglement. The method delivers superior audio quality and speaker similarity for English any-to-any VC and shows competitive cross-lingual VC with monolingual training data. This work suggests that speaker timbre can be effectively manipulated through discrete token vocoders, offering a scalable path toward LLM-driven speech generation.

Abstract

We propose a new speech discrete token vocoder, vec2wav 2.0, which advances voice conversion (VC). We use discrete tokens from speech self-supervised models as the content features of source speech, and treat VC as a prompted vocoding task. To amend the loss of speaker timbre in the content tokens, vec2wav 2.0 utilizes the WavLM features to provide strong timbre-dependent information. A novel adaptive Snake activation function is proposed to better incorporate timbre into the waveform reconstruction process. In this way, vec2wav 2.0 learns to alter the speaker timbre appropriately given different reference prompts. Also, no supervised data is required for vec2wav 2.0 to be effectively trained. Experimental results demonstrate that vec2wav 2.0 outperforms all other baselines to a considerable margin in terms of audio quality and speaker similarity in any-to-any VC. Ablation studies verify the effects made by the proposed techniques. Moreover, vec2wav 2.0 achieves competitive cross-lingual VC even only trained on monolingual corpus. Thus, vec2wav 2.0 shows timbre can potentially be manipulated only by speech token vocoders, pushing the frontiers of VC and speech synthesis.
Paper Structure (13 sections, 1 equation, 3 figures, 2 tables)

This paper contains 13 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Architecture overview of vec2wav 2.0.
  • Figure 2: Detailed architecture of BigVGAN generator with proposed adaptive Snake activations.
  • Figure 3: Objective SECS and P.Corr comparisons with varied input tokens and models. Perfect VC systems should lie on the top right corner.