Table of Contents
Fetching ...

FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation

Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo

TL;DR

FastVoiceGrad tackles the slow inference of diffusion-based voice conversion by learning a one-step reverse diffusion model via adversarial conditional diffusion distillation (ACDD). It revisits sampling initialization, using a diffused source state $\boldsymbol{x}_{S_K}^{\mathrm{src}}$ and a reduced start step $S_K$, to better preserve content while enabling single-step generation. The method integrates adversarial losses in the waveform domain, feature matching, and teacher-guided score distillation to train a competitive one-step student, achieving high VC quality with substantially faster inference than multi-step baselines. Results on VCTK and LibriTTS demonstrate that FastVoiceGrad matches or exceeds the performance of several diffusion-based VC methods while delivering real-time-friendly speeds, highlighting its practical potential for real-time, any-to-any VC applications.

Abstract

Diffusion-based voice conversion (VC) techniques such as VoiceGrad have attracted interest because of their high VC performance in terms of speech quality and speaker similarity. However, a notable limitation is the slow inference caused by the multi-step reverse diffusion. Therefore, we propose FastVoiceGrad, a novel one-step diffusion-based VC that reduces the number of iterations from dozens to one while inheriting the high VC performance of the multi-step diffusion-based VC. We obtain the model using adversarial conditional diffusion distillation (ACDD), leveraging the ability of generative adversarial networks and diffusion models while reconsidering the initial states in sampling. Evaluations of one-shot any-to-any VC demonstrate that FastVoiceGrad achieves VC performance superior to or comparable to that of previous multi-step diffusion-based VC while enhancing the inference speed. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/fastvoicegrad/.

FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation

TL;DR

FastVoiceGrad tackles the slow inference of diffusion-based voice conversion by learning a one-step reverse diffusion model via adversarial conditional diffusion distillation (ACDD). It revisits sampling initialization, using a diffused source state and a reduced start step , to better preserve content while enabling single-step generation. The method integrates adversarial losses in the waveform domain, feature matching, and teacher-guided score distillation to train a competitive one-step student, achieving high VC quality with substantially faster inference than multi-step baselines. Results on VCTK and LibriTTS demonstrate that FastVoiceGrad matches or exceeds the performance of several diffusion-based VC methods while delivering real-time-friendly speeds, highlighting its practical potential for real-time, any-to-any VC applications.

Abstract

Diffusion-based voice conversion (VC) techniques such as VoiceGrad have attracted interest because of their high VC performance in terms of speech quality and speaker similarity. However, a notable limitation is the slow inference caused by the multi-step reverse diffusion. Therefore, we propose FastVoiceGrad, a novel one-step diffusion-based VC that reduces the number of iterations from dozens to one while inheriting the high VC performance of the multi-step diffusion-based VC. We obtain the model using adversarial conditional diffusion distillation (ACDD), leveraging the ability of generative adversarial networks and diffusion models while reconsidering the initial states in sampling. Evaluations of one-shot any-to-any VC demonstrate that FastVoiceGrad achieves VC performance superior to or comparable to that of previous multi-step diffusion-based VC while enhancing the inference speed. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/fastvoicegrad/.
Paper Structure (11 sections, 13 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 11 sections, 13 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: Comparison between (a) typical multi-step diffusion-based VC (e.g., VoiceGrad HKameokaTASLP2024) and (b) proposed one-step diffusion-based VC (FastVoiceGrad). FastVoiceGrad reduces the required number of iterations from dozens to one and improves the inference speed (e.g., $\times 30$ in this example).
  • Figure 2: Relationship between DNSMOS and $S_K$ and that between SVA and $S_K$. Clean source $\bm{x}_0^{src}$ (blue line) and diffused source $\bm{x}_{S_K}^{src}$ (orange line) were used as initial values of $\bm{x}$. The scores were calculated for $S_K$ sampled per 50 steps.