Table of Contents
Fetching ...

DDSP: Differentiable Digital Signal Processing

Jesse Engel, Lamtharn Hantrakul, Chenjie Gu, Adam Roberts

TL;DR

DDSP introduces a differentiable DSP framework that integrates oscillators, envelopes, filters, and reverb into end-to-end trainable neural networks for audio synthesis. By leveraging interpretable DSP components, the approach achieves high-fidelity sound without autoregressive or adversarial losses and enables independent control over pitch, loudness, and timbre, as well as extrapolation and acoustic transfer. The paper demonstrates strong results on NSynth and solo violin data, including dereverberation and timbre transfer, with significantly smaller models than conventional neural synthesizers. The open-source DDSP library provides a modular toolkit to combine classical signal processing with modern deep learning, encouraging broader adoption and extension by the community.

Abstract

Most generative models of audio directly generate samples in one of two domains: time or frequency. While sufficient to express any signal, these representations are inefficient, as they do not utilize existing knowledge of how sound is generated and perceived. A third approach (vocoders/synthesizers) successfully incorporates strong domain knowledge of signal processing and perception, but has been less actively researched due to limited expressivity and difficulty integrating with modern auto-differentiation-based machine learning methods. In this paper, we introduce the Differentiable Digital Signal Processing (DDSP) library, which enables direct integration of classic signal processing elements with deep learning methods. Focusing on audio synthesis, we achieve high-fidelity generation without the need for large autoregressive models or adversarial losses, demonstrating that DDSP enables utilizing strong inductive biases without losing the expressive power of neural networks. Further, we show that combining interpretable modules permits manipulation of each separate model component, with applications such as independent control of pitch and loudness, realistic extrapolation to pitches not seen during training, blind dereverberation of room acoustics, transfer of extracted room acoustics to new environments, and transformation of timbre between disparate sources. In short, DDSP enables an interpretable and modular approach to generative modeling, without sacrificing the benefits of deep learning. The library is publicly available at https://github.com/magenta/ddsp and we welcome further contributions from the community and domain experts.

DDSP: Differentiable Digital Signal Processing

TL;DR

DDSP introduces a differentiable DSP framework that integrates oscillators, envelopes, filters, and reverb into end-to-end trainable neural networks for audio synthesis. By leveraging interpretable DSP components, the approach achieves high-fidelity sound without autoregressive or adversarial losses and enables independent control over pitch, loudness, and timbre, as well as extrapolation and acoustic transfer. The paper demonstrates strong results on NSynth and solo violin data, including dereverberation and timbre transfer, with significantly smaller models than conventional neural synthesizers. The open-source DDSP library provides a modular toolkit to combine classical signal processing with modern deep learning, encouraging broader adoption and extension by the community.

Abstract

Most generative models of audio directly generate samples in one of two domains: time or frequency. While sufficient to express any signal, these representations are inefficient, as they do not utilize existing knowledge of how sound is generated and perceived. A third approach (vocoders/synthesizers) successfully incorporates strong domain knowledge of signal processing and perception, but has been less actively researched due to limited expressivity and difficulty integrating with modern auto-differentiation-based machine learning methods. In this paper, we introduce the Differentiable Digital Signal Processing (DDSP) library, which enables direct integration of classic signal processing elements with deep learning methods. Focusing on audio synthesis, we achieve high-fidelity generation without the need for large autoregressive models or adversarial losses, demonstrating that DDSP enables utilizing strong inductive biases without losing the expressive power of neural networks. Further, we show that combining interpretable modules permits manipulation of each separate model component, with applications such as independent control of pitch and loudness, realistic extrapolation to pitches not seen during training, blind dereverberation of room acoustics, transfer of extracted room acoustics to new environments, and transformation of timbre between disparate sources. In short, DDSP enables an interpretable and modular approach to generative modeling, without sacrificing the benefits of deep learning. The library is publicly available at https://github.com/magenta/ddsp and we welcome further contributions from the community and domain experts.

Paper Structure

This paper contains 37 sections, 5 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Challenges of neural audio synthesis. Full description provided in Section \ref{['challenges']}.
  • Figure 2: Autoencoder architecture. Red components are part of the neural network architecture, green components are the latent representation, and yellow components are deterministic synthesizers and effects. Components with dashed borders are not used in all of our experiments. Namely, ${\bm{z}}$ is not used in the model trained on solo violin, and reverb is not used in the models trained on NSynth. See the appendix for more detailed diagrams of the neural network components.
  • Figure 3: Separate interpolations over loudness, pitch, and timbre. The conditioning features (solid lines) are extracted from two notes and linearly mixed (dark to light coloring). The features of the resynthsized audio (dashed lines) closely follow the conditioning. On the right, the latent vectors, $z(t)$, are interpolated, and the spectral centroid of resulting audio (thin solid lines) smoothly varies between the original samples (dark solid lines).
  • Figure 4: Timbre transfer from singing voice to violin. F0 and loudness features are extracted from the voice and resynthesized with a DDSP autoencoder trained on solo violin.
  • Figure 5: Decomposition of a clip of solo violin. Audio is visualized with log magnitude spectrograms. Loudness and fundamental frequency signals are extracted from the original audio. The loudness curve does not exhibit clear note segmentations because of the effects of the room acoustics. The DDSP autoencoder takes those conditioning signals and predicts amplitudes, harmonic distributions, and noise magnitudes. Note that the amplitudes are clearly segmented along note boundaries without supervision and that the harmonic and noise distributions are complex and dynamic despite the simple conditioning signals. Finally, the extracted impulse response is applied to the combined audio from the synthesizers to give the full resynthesis audio.
  • ...and 4 more figures