DDSP: Differentiable Digital Signal Processing
Jesse Engel, Lamtharn Hantrakul, Chenjie Gu, Adam Roberts
TL;DR
DDSP introduces a differentiable DSP framework that integrates oscillators, envelopes, filters, and reverb into end-to-end trainable neural networks for audio synthesis. By leveraging interpretable DSP components, the approach achieves high-fidelity sound without autoregressive or adversarial losses and enables independent control over pitch, loudness, and timbre, as well as extrapolation and acoustic transfer. The paper demonstrates strong results on NSynth and solo violin data, including dereverberation and timbre transfer, with significantly smaller models than conventional neural synthesizers. The open-source DDSP library provides a modular toolkit to combine classical signal processing with modern deep learning, encouraging broader adoption and extension by the community.
Abstract
Most generative models of audio directly generate samples in one of two domains: time or frequency. While sufficient to express any signal, these representations are inefficient, as they do not utilize existing knowledge of how sound is generated and perceived. A third approach (vocoders/synthesizers) successfully incorporates strong domain knowledge of signal processing and perception, but has been less actively researched due to limited expressivity and difficulty integrating with modern auto-differentiation-based machine learning methods. In this paper, we introduce the Differentiable Digital Signal Processing (DDSP) library, which enables direct integration of classic signal processing elements with deep learning methods. Focusing on audio synthesis, we achieve high-fidelity generation without the need for large autoregressive models or adversarial losses, demonstrating that DDSP enables utilizing strong inductive biases without losing the expressive power of neural networks. Further, we show that combining interpretable modules permits manipulation of each separate model component, with applications such as independent control of pitch and loudness, realistic extrapolation to pitches not seen during training, blind dereverberation of room acoustics, transfer of extracted room acoustics to new environments, and transformation of timbre between disparate sources. In short, DDSP enables an interpretable and modular approach to generative modeling, without sacrificing the benefits of deep learning. The library is publicly available at https://github.com/magenta/ddsp and we welcome further contributions from the community and domain experts.
