A comparative study of generative models for child voice conversion
Protima Nomo Sudro, Anton Ragni, Thomas Hain
TL;DR
This work compares four generative approaches—GAN-based CycleGAN-VC2, VAE, flow-based, and diffusion models—for adult-to-child voice conversion using acted dubbing data. It introduces a frequency-warping post-processing step to reduce mismatch with target child speech and evaluates both objective metrics (MCD, F0 RMSE) and subjective MOS. The results show CycleGAN-VC2 and diffusion models deliver strong objective performance, with warping consistently improving similarity across models; diffusion benefits most from larger training data. The findings highlight the trade-offs between one-to-one mappings and multi-speaker capabilities for dubbing applications, guiding model selection based on data availability and target deployment.
Abstract
Generative models are a popular choice for adult-to-adult voice conversion (VC) because of their efficient way of modelling unlabelled data. To this point their usefulness in producing children speech and in particular adult to child VC has not been investigated. For adult to child VC, four generative models are compared: diffusion model, flow based model, variational autoencoders, and generative adversarial network. Results show that although converted speech outputs produce by those models appear plausible, they exhibit insufficient similarity with the target speaker characteristics. We introduce an efficient frequency warping technique that can be applied to the output of models, and which shows significant reduction of the mismatch between adult and child. The output of all the models are evaluated using both objective and subjective measures. In particular we compare specific speaker pairing using a unique corpus collected for dubbing of children speech.
