Table of Contents
Fetching ...

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, Mohammad Norouzi

TL;DR

This work addresses neural audio synthesis by introducing a WaveNet-style autoencoder that learns temporal embeddings from raw audio to condition a powerful autoregressive decoder, eliminating the need for external long-range conditioning. It introduces NSynth, a large-scale dataset of ~306k four-second notes across ~1000 instruments, enabling analysis of embedding spaces and perceptual fidelity. The WaveNet autoencoder outperforms a strong spectral autoencoder baseline in reconstruction and timbre interpolation, and its embeddings support meaningful pitch-timbre morphing and extrapolation to longer contexts. Together, these contributions establish a practical benchmark and a scalable framework for high-quality neural audio synthesis with controllable timbre and dynamics.

Abstract

Generative models in vision have seen rapid progress due to algorithmic improvements and the availability of high-quality image datasets. In this paper, we offer contributions in both these areas to enable similar progress in audio modeling. First, we detail a powerful new WaveNet-style autoencoder model that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform. Second, we introduce NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets. Using NSynth, we demonstrate improved qualitative and quantitative performance of the WaveNet autoencoder over a well-tuned spectral autoencoder baseline. Finally, we show that the model learns a manifold of embeddings that allows for morphing between instruments, meaningfully interpolating in timbre to create new types of sounds that are realistic and expressive.

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

TL;DR

This work addresses neural audio synthesis by introducing a WaveNet-style autoencoder that learns temporal embeddings from raw audio to condition a powerful autoregressive decoder, eliminating the need for external long-range conditioning. It introduces NSynth, a large-scale dataset of ~306k four-second notes across ~1000 instruments, enabling analysis of embedding spaces and perceptual fidelity. The WaveNet autoencoder outperforms a strong spectral autoencoder baseline in reconstruction and timbre interpolation, and its embeddings support meaningful pitch-timbre morphing and extrapolation to longer contexts. Together, these contributions establish a practical benchmark and a scalable framework for high-quality neural audio synthesis with controllable timbre and dynamics.

Abstract

Generative models in vision have seen rapid progress due to algorithmic improvements and the availability of high-quality image datasets. In this paper, we offer contributions in both these areas to enable similar progress in audio modeling. First, we detail a powerful new WaveNet-style autoencoder model that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform. Second, we introduce NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets. Using NSynth, we demonstrate improved qualitative and quantitative performance of the WaveNet autoencoder over a well-tuned spectral autoencoder baseline. Finally, we show that the model learns a manifold of embeddings that allows for morphing between instruments, meaningfully interpolating in timbre to create new types of sounds that are realistic and expressive.

Paper Structure

This paper contains 21 sections, 2 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Models considered in this paper. For both models, we optionally condition on pitch by concatenating the hidden embedding with a one-hot pitch representation. $1a.$ Baseline spectral autoencoder: Each block represents a nonlinear 2-D convolution with stride ($s$), kernel size ($k$), and channels (#). $1b.$ The WaveNet autoencoder: Downsampling in the encoder occurs only in the average pooling layer. The embeddings are distributed in time and upsampled with nearest neighbor interpolation to the original resolution before biasing each layer of the decoder. 'NC' indicates non-causal convolution. '1x1' indicates a 1-D convolution with kernel size 1. See Section \ref{['sec:WaveNetAutoencoder']} for further details.
  • Figure 2: Reconstructions of notes from three different instruments. Each note is displayed as a "Rainbowgram", a CQT spectrogram with intensity of lines proportional to the log magnitude of the power spectrum and color given by the derivative of the phase. Time is on the horizontal axis and frequency on the vertical axis. See Section \ref{['sec:Reconstruction']} for details. (Listen: Glockenspiel (https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure2_Reconstruction/Originals/Glockenspiel.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure2_Reconstruction/WaveNet/Glockenspiel.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure2_Reconstruction/Baseline/Glockenspiel.mp3), Electric Piano (https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure2_Reconstruction/Originals/ElectricPiano.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure2_Reconstruction/WaveNet/ElectricPiano.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure2_Reconstruction/Baseline/ElectricPiano.mp3), Flugelhorn (https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure2_Reconstruction/Originals/Flugelhorn.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure2_Reconstruction/WaveNet/Flugelhorn.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure2_Reconstruction/Baseline/Flugelhorn.mp3))
  • Figure 3: Rainbowgrams of linear interpolations between three different notes from instruments in the holdout set. For the original rainbowgrams, the raw audio is linearly mixed. For the models, samples are generated from linear interpolations in embedding space. See Section \ref{['sec:InterpolateZ']} for details.(Listen: Original (https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure3_Interpolation/Original/0_Bass.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure3_Interpolation/Original/1_Bass+Flute.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure3_Interpolation/Original/2_Flute.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure3_Interpolation/Original/3_Flute+Organ.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure3_Interpolation/Original/4_Organ.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure3_Interpolation/Original/5_Organ+Bass.mp3), WaveNet (https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure3_Interpolation/WaveNet/0_Bass.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure3_Interpolation/WaveNet/1_Bass+Flute.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure3_Interpolation/WaveNet/2_Flute.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure3_Interpolation/WaveNet/3_Flute+Organ.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure3_Interpolation/WaveNet/4_Organ.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure3_Interpolation/WaveNet/5_Organ+Bass.mp3), Baseline (https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure3_Interpolation/Baseline/0_Bass.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure3_Interpolation/Baseline/1_Bass+Flute.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure3_Interpolation/Baseline/2_Flute.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure3_Interpolation/Baseline/3_Flute+Organ.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure3_Interpolation/Baseline/4_Organ.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure3_Interpolation/Baseline/5_Organ+Bass.mp3))
  • Figure 4: Conditioning on pitch. These rainbowgrams are reconstructions of a single electric piano note from the holdout set. They were synthesized with the baseline model (128 hidden dimensions). By holding $Z$ constant and conditioning on different pitches, we can play two octaves of a C major chord from a single embedding. The original pitch (MIDI C60) is dashed in white for comparison. See Section \ref{['sec:PitchShift']} for details. (Listen: https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure4_Pitch/Baseline/0_pitch_-12.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure4_Pitch/Baseline/1_pitch_-8.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure4_Pitch/Baseline/2_pitch_-5.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure4_Pitch/Baseline/3_pitch_0.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure4_Pitch/Baseline/4_pitch_+4.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure4_Pitch/Baseline/5_pitch_+7.mp3, https://download.magenta.tensorflow.org/audio_examples/nsynth/Figure4_Pitch/Baseline/6_pitch_+12.mp3)
  • Figure 5: Correlation of embeddings across pitch for three different instruments and the average across all instruments. These embeddings were taken from a WaveNet model trained without pitch conditioning.
  • ...and 5 more figures