Table of Contents
Fetching ...

GANSynth: Adversarial Neural Audio Synthesis

Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, Adam Roberts

TL;DR

This work addresses the challenge of high-fidelity, coherent audio synthesis by combining Generative Adversarial Networks with spectral-domain representations. By generating log-magnitude spectrograms (and either phase or instantaneous frequency) and exploiting rich frequency-resolution and pitch conditioning, the authors demonstrate that GANs can outperform a strong WaveNet baseline on NSynth while enabling orders-of-magnitude faster generation. Key contributions include identifying effective spectral representations for GANs, showing superior perceptual and diversity metrics for IF-based and high-resolution spectrograms, and achieving real-time-like generation speeds suitable for on-device synthesis. The findings imply a practical shift toward spectral GANs for scalable, controllable audio synthesis with potential impact on music production, real-time sound design, and embedded audio applications.

Abstract

Efficient audio synthesis is an inherently difficult machine learning task, as human perception is sensitive to both global structure and fine-scale waveform coherence. Autoregressive models, such as WaveNet, model local structure at the expense of global latent structure and slow iterative sampling, while Generative Adversarial Networks (GANs), have global latent conditioning and efficient parallel sampling, but struggle to generate locally-coherent audio waveforms. Herein, we demonstrate that GANs can in fact generate high-fidelity and locally-coherent audio by modeling log magnitudes and instantaneous frequencies with sufficient frequency resolution in the spectral domain. Through extensive empirical investigations on the NSynth dataset, we demonstrate that GANs are able to outperform strong WaveNet baselines on automated and human evaluation metrics, and efficiently generate audio several orders of magnitude faster than their autoregressive counterparts.

GANSynth: Adversarial Neural Audio Synthesis

TL;DR

This work addresses the challenge of high-fidelity, coherent audio synthesis by combining Generative Adversarial Networks with spectral-domain representations. By generating log-magnitude spectrograms (and either phase or instantaneous frequency) and exploiting rich frequency-resolution and pitch conditioning, the authors demonstrate that GANs can outperform a strong WaveNet baseline on NSynth while enabling orders-of-magnitude faster generation. Key contributions include identifying effective spectral representations for GANs, showing superior perceptual and diversity metrics for IF-based and high-resolution spectrograms, and achieving real-time-like generation speeds suitable for on-device synthesis. The findings imply a practical shift toward spectral GANs for scalable, controllable audio synthesis with potential impact on music production, real-time sound design, and embedded audio applications.

Abstract

Efficient audio synthesis is an inherently difficult machine learning task, as human perception is sensitive to both global structure and fine-scale waveform coherence. Autoregressive models, such as WaveNet, model local structure at the expense of global latent structure and slow iterative sampling, while Generative Adversarial Networks (GANs), have global latent conditioning and efficient parallel sampling, but struggle to generate locally-coherent audio waveforms. Herein, we demonstrate that GANs can in fact generate high-fidelity and locally-coherent audio by modeling log magnitudes and instantaneous frequencies with sufficient frequency resolution in the spectral domain. Through extensive empirical investigations on the NSynth dataset, we demonstrate that GANs are able to outperform strong WaveNet baselines on automated and human evaluation metrics, and efficiently generate audio several orders of magnitude faster than their autoregressive counterparts.

Paper Structure

This paper contains 20 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Frame-based estimation of audio waveforms. Much of sound is made up of locally-coherent waves with a local periodicity, pictured as the red-yellow sinusoid with black dots at the start of each cycle. Frame-based techniques, whether they be transposed convolutions or STFTs, have a given frame size and stride, here depicted as equal with boundaries at the dotted lines. The alignment between the two (phase, indicated by the solid black line and yellow boxes), precesses in time since the periodicity of the audio and the output stride are not exactly the same. Transposed convolutional filters thus have the difficult task of covering all the necessary frequencies and all possible phase alignments to preserve phase coherence. For an STFT, we can unwrap the phase over the 2$\pi$ boundary (orange boxes) and take its derivative to get the instantaneous radial frequency (red boxes), which expresses the constant relationship between audio frequency and frame frequency. The spectra are shown for an example trumpet note from the NSynth dataset.
  • Figure 2: Number of wins on pair-wise comparison across different output representations and baselines. Ablation comparing highest performing models of each type. Higher scores represent better perceptual quality to participants. The ranking observed here correlates well with the evaluation on quantitative metrics as in Table \ref{['table:metrics']}.
  • Figure 3: Phase coherence. Examples are selected to be roughly similar between the models for illustrative purposes. The top row shows the waveform modulo the fundamental periodicity of the note (MIDI C60), for 1028 examples taken in the middle of the note. Notice that the real data completely overlaps itself as the waveform is extremely periodic. The WaveGAN and PhaseGAN, however, have many phase irregularities, creating a blurry web of lines. The IFGAN is much more coherent, having only small variations from cycle-to-cycle. In the Rainbowgrams below, the real data and IF models have coherent waveforms that result in strong consistent colors for each harmonic, while the PhaseGAN has many speckles due to phase discontinuities, and the WaveGAN model is quite irregular.
  • Figure 4: Global interpolation. Examples available for listening. Interpolating between waveforms perceptually results in crossfading the volumes of two distinct sounds (rainbowgrams at top). The WaveNet autoencoder (middle) only has local conditioning distributed in time, and no compact prior over those time series, so linear interpolation ventures off the true prior / data manifold, and produces in-between sounds that are less realistic examples and feature the default failure mode of autoregressive wavenets (feedback harmonics). Meanwhile, the IF-Mel GAN (bottom) has global conditioning so interpolating in perceptual attributes while staying along the prior at all intermediate points, so they produce high-fidelity audio examples like the endpoints.
  • Figure 5: NDB bin proportions for the IF-Mel + H model and the WaveGAN baseline (evaluated with examples of pitch 60).
  • ...and 2 more figures