Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, Mohammad Norouzi
TL;DR
This work addresses neural audio synthesis by introducing a WaveNet-style autoencoder that learns temporal embeddings from raw audio to condition a powerful autoregressive decoder, eliminating the need for external long-range conditioning. It introduces NSynth, a large-scale dataset of ~306k four-second notes across ~1000 instruments, enabling analysis of embedding spaces and perceptual fidelity. The WaveNet autoencoder outperforms a strong spectral autoencoder baseline in reconstruction and timbre interpolation, and its embeddings support meaningful pitch-timbre morphing and extrapolation to longer contexts. Together, these contributions establish a practical benchmark and a scalable framework for high-quality neural audio synthesis with controllable timbre and dynamics.
Abstract
Generative models in vision have seen rapid progress due to algorithmic improvements and the availability of high-quality image datasets. In this paper, we offer contributions in both these areas to enable similar progress in audio modeling. First, we detail a powerful new WaveNet-style autoencoder model that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform. Second, we introduce NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets. Using NSynth, we demonstrate improved qualitative and quantitative performance of the WaveNet autoencoder over a well-tuned spectral autoencoder baseline. Finally, we show that the model learns a manifold of embeddings that allows for morphing between instruments, meaningfully interpolating in timbre to create new types of sounds that are realistic and expressive.
