Synthesizer Sound Matching Using Audio Spectrogram Transformers
Fred Bruford, Frederik Blang, Shahan Nercessian
TL;DR
The paper tackles synthesizer sound matching by inferring $16$ continuous synthesis parameters from audio using a general, architecture-agnostic approach. It introduces an Audio Spectrogram Transformer (AST) as a regression backbone, trained on a large synthetic dataset from the Massive synthesizer, with a $64$-bin Mel spectrogram input and a $3$-layer MLP head, optimizing with $L_{MSE}$. The AST outperforms a 5-layer MLP and a 5-layer CNN baselines on both parameter reconstruction and audio fidelity measured by Spectral Convergence (SC), and shows promising out-of-domain generalization to vocal imitations and other instruments. This work demonstrates the viability of transformer-based, general-purpose synthesizer sound matching without differentiable synthesizers, offering a scalable path toward broader applicability across diverse sonic sources and control tasks.
Abstract
Systems for synthesizer sound matching, which automatically set the parameters of a synthesizer to emulate an input sound, have the potential to make the process of synthesizer programming faster and easier for novice and experienced musicians alike, whilst also affording new means of interaction with synthesizers. Considering the enormous variety of synthesizers in the marketplace, and the complexity of many of them, general-purpose sound matching systems that function with minimal knowledge or prior assumptions about the underlying synthesis architecture are particularly desirable. With this in mind, we introduce a synthesizer sound matching model based on the Audio Spectrogram Transformer. We demonstrate the viability of this model by training on a large synthetic dataset of randomly generated samples from the popular Massive synthesizer. We show that this model can reconstruct parameters of samples generated from a set of 16 parameters, highlighting its improved fidelity relative to multi-layer perceptron and convolutional neural network baselines. We also provide audio examples demonstrating the out-of-domain model performance in emulating vocal imitations, and sounds from other synthesizers and musical instruments.
