Towards the Synthesis of Non-speech Vocalizations
Enjamamul Hoq, Ifeoma Nwogu
TL;DR
This work tackles the unconditional synthesis of non-speech vocalizations, focusing on infant cries, by applying DiffWave, a diffusion probabilistic model trained on two infant-cry datasets. The model learns a reverse denoising process to generate audio from Gaussian noise, optimizing an ELBO that reduces to predicting the denoising noise, and utilizes 200 diffusion steps with diffusion-step conditioning via a sinusoidal embedding. Key contributions include dataset-driven analysis for infant cries, a detailed non-autoregressive architecture with bidirectional dilated convolutions, and practical fast-sampling techniques that reduce reverse steps to as few as 6 while maintaining quality. The results demonstrate high-fidelity, diverse cry generation at 16 kHz, offering a data-augmentation path for pediatric acoustic research and potential privacy-preserving synthetic data generation.
Abstract
In this report, we focus on the unconditional generation of infant cry sounds using the DiffWave framework, which has shown great promise in generating high-quality audio from noise. We use two distinct datasets of infant cries: the Baby Chillanto and the deBarbaro cry dataset. These datasets are used to train the DiffWave model to generate new cry sounds that maintain high fidelity and diversity. The focus here is on DiffWave's capability to handle the unconditional generation task.
