VoiceGrad: Non-Parallel Any-to-Many Voice Conversion with Annealed Langevin Dynamics
Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo, Shogo Seki
TL;DR
VoiceGrad introduces a non-parallel any-to-many voice conversion framework based on score-based generative modeling, leveraging DSM/DPM and annealed Langevin dynamics to iteratively transform source mel-spectrograms into target-speaker representations. The method uses a U-Net–style score network conditioned on target speaker and optional BNF linguistic features, enabling robust conversions without parallel data. Experimental results show competitive or superior objective and subjective performance versus baselines, with the diffusion-based variant offering faster convergence and enhanced intelligibility through BNF conditioning. The approach highlights the versatility of score-based diffusion methods for flexible and customizable voice conversion tasks.
Abstract
In this paper, we propose a non-parallel any-to-many voice conversion (VC) method termed VoiceGrad. Inspired by WaveGrad, a recently introduced novel waveform generation method, VoiceGrad is based upon the concepts of score matching and Langevin dynamics. It uses weighted denoising score matching to train a score approximator, a fully convolutional network with a U-Net structure designed to predict the gradient of the log density of the speech feature sequences of multiple speakers, and performs VC by using annealed Langevin dynamics to iteratively update an input feature sequence towards the nearest stationary point of the target distribution based on the trained score approximator network. Thanks to the nature of this concept, VoiceGrad enables any-to-many VC, a VC scenario in which the speaker of input speech can be arbitrary, and allows for non-parallel training, which requires no parallel utterances or transcriptions.
