Table of Contents
Fetching ...

AdaptVC: High Quality Voice Conversion with Adaptive Learning

Jaehun Kim, Ji-Hoon Kim, Yeunju Choi, Tan Dat Nguyen, Seongkyu Mun, Joon Son Chung

TL;DR

AdaptVC tackles zero-shot voice conversion by disentangling linguistic content from speaker style using adapters that finely tune self-supervised speech representations. The architecture combines a content encoder with a vector-quantized content representation and a speaker encoder whose frame-wise features feed a cross-attention conditioned CFM decoder optimized with an OT-based loss, enabling high-fidelity synthesis. Training optimizes $L_{commit}$, $L_{prior}$, and $L_{dec}$ to enforce discrete content, prior alignment, and efficient flow-based decoding: $\mathcal{L}_{total} = \mathcal{L}_{commit} + \mathcal{L}_{prior} + \mathcal{L}_{dec}$. Experimental results on LibriTTS and VCTK demonstrate superior naturalness and speaker similarity, with real-time performance enabled by the multi-step, cross-attention conditioned decoding and adaptive SSL feature integration.

Abstract

The goal of voice conversion is to transform the speech of a source speaker to sound like that of a reference speaker while preserving the original content. A key challenge is to extract disentangled linguistic content from the source and voice style from the reference. While existing approaches leverage various methods to isolate the two, a generalization still requires further attention, especially for robustness in zero-shot scenarios. In this paper, we achieve successful disentanglement of content and speaker features by tuning self-supervised speech features with adapters. The adapters are trained to dynamically encode nuanced features from rich self-supervised features, and the decoder fuses them to produce speech that accurately resembles the reference with minimal loss of content. Moreover, we leverage a conditional flow matching decoder with cross-attention speaker conditioning to further boost the synthesis quality and efficiency. Subjective and objective evaluations in a zero-shot scenario demonstrate that the proposed method outperforms existing models in speech quality and similarity to the reference speech.

AdaptVC: High Quality Voice Conversion with Adaptive Learning

TL;DR

AdaptVC tackles zero-shot voice conversion by disentangling linguistic content from speaker style using adapters that finely tune self-supervised speech representations. The architecture combines a content encoder with a vector-quantized content representation and a speaker encoder whose frame-wise features feed a cross-attention conditioned CFM decoder optimized with an OT-based loss, enabling high-fidelity synthesis. Training optimizes , , and to enforce discrete content, prior alignment, and efficient flow-based decoding: . Experimental results on LibriTTS and VCTK demonstrate superior naturalness and speaker similarity, with real-time performance enabled by the multi-step, cross-attention conditioned decoding and adaptive SSL feature integration.

Abstract

The goal of voice conversion is to transform the speech of a source speaker to sound like that of a reference speaker while preserving the original content. A key challenge is to extract disentangled linguistic content from the source and voice style from the reference. While existing approaches leverage various methods to isolate the two, a generalization still requires further attention, especially for robustness in zero-shot scenarios. In this paper, we achieve successful disentanglement of content and speaker features by tuning self-supervised speech features with adapters. The adapters are trained to dynamically encode nuanced features from rich self-supervised features, and the decoder fuses them to produce speech that accurately resembles the reference with minimal loss of content. Moreover, we leverage a conditional flow matching decoder with cross-attention speaker conditioning to further boost the synthesis quality and efficiency. Subjective and objective evaluations in a zero-shot scenario demonstrate that the proposed method outperforms existing models in speech quality and similarity to the reference speech.
Paper Structure (17 sections, 4 equations, 3 figures, 2 tables)

This paper contains 17 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overall architecture of AdaptVC. $\boldsymbol{h_{cont}}$ denotes the content representation from the adapter in the content encoder, and $\boldsymbol{h_{spk}}$ denotes the speaker features from that in the speaker encoder. Prior distribution $\boldsymbol{\mu}$ is obtained by fusing the content and speaker information through cross-attention.
  • Figure 2: Illustration of HuBERT adapter mechanism (a) and decoder block in the CFM decoder (b). Residual input $\boldsymbol{{\mu_{res}}}$ is concatenated to $\boldsymbol{\mu}$ only for blocks with skip connections.
  • Figure 3: Visualization of adapter weights. Numbers in the x-axis indicate layer indices and y-axis denotes the trained weights.