RAVE: A variational autoencoder for fast and high-quality neural audio synthesis
Antoine Caillon, Philippe Esling
TL;DR
RAVE tackles the challenge of fast yet high-quality neural audio synthesis by coupling a variational autoencoder with a two-stage training procedure and a multiband waveform representation to reach 48kHz in real time on CPU. A Stage 1 representation learning phase uses a multiscale spectral loss with the ELBO objective, followed by Stage 2 adversarial fine-tuning that freezes the encoder and improves realism through a GAN objective and feature matching. The authors introduce a post-training latent-space analysis using SVD to identify the informative latent dimensions and a fidelity parameter $f$ to balance reconstruction fidelity with representation compactness, enabling compact, manipulable latent codes. They demonstrate state-of-the-art perceptual quality, strong synthesis speed (20× realtime on CPU with 48kHz signals), timbre transfer, and a data-driven compression capability, all while providing open-source code and audio samples.
Abstract
Deep generative models applied to audio have improved by a large margin the state-of-the-art in many speech and music related tasks. However, as raw waveform modelling remains an inherently difficult task, audio generative models are either computationally intensive, rely on low sampling rates, are complicated to control or restrict the nature of possible signals. Among those models, Variational AutoEncoders (VAE) give control over the generation by exposing latent variables, although they usually suffer from low synthesis quality. In this paper, we introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis. We introduce a novel two-stage training procedure, namely representation learning and adversarial fine-tuning. We show that using a post-training analysis of the latent space allows a direct control between the reconstruction fidelity and the representation compactness. By leveraging a multi-band decomposition of the raw waveform, we show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU. We evaluate synthesis quality using both quantitative and qualitative subjective experiments and show the superiority of our approach compared to existing models. Finally, we present applications of our model for timbre transfer and signal compression. All of our source code and audio examples are publicly available.
