Synthesizer Sound Matching Using Audio Spectrogram Transformers

Fred Bruford; Frederik Blang; Shahan Nercessian

Synthesizer Sound Matching Using Audio Spectrogram Transformers

Fred Bruford, Frederik Blang, Shahan Nercessian

TL;DR

The paper tackles synthesizer sound matching by inferring $16$ continuous synthesis parameters from audio using a general, architecture-agnostic approach. It introduces an Audio Spectrogram Transformer (AST) as a regression backbone, trained on a large synthetic dataset from the Massive synthesizer, with a $64$-bin Mel spectrogram input and a $3$-layer MLP head, optimizing with $L_{MSE}$. The AST outperforms a 5-layer MLP and a 5-layer CNN baselines on both parameter reconstruction and audio fidelity measured by Spectral Convergence (SC), and shows promising out-of-domain generalization to vocal imitations and other instruments. This work demonstrates the viability of transformer-based, general-purpose synthesizer sound matching without differentiable synthesizers, offering a scalable path toward broader applicability across diverse sonic sources and control tasks.

Abstract

Systems for synthesizer sound matching, which automatically set the parameters of a synthesizer to emulate an input sound, have the potential to make the process of synthesizer programming faster and easier for novice and experienced musicians alike, whilst also affording new means of interaction with synthesizers. Considering the enormous variety of synthesizers in the marketplace, and the complexity of many of them, general-purpose sound matching systems that function with minimal knowledge or prior assumptions about the underlying synthesis architecture are particularly desirable. With this in mind, we introduce a synthesizer sound matching model based on the Audio Spectrogram Transformer. We demonstrate the viability of this model by training on a large synthetic dataset of randomly generated samples from the popular Massive synthesizer. We show that this model can reconstruct parameters of samples generated from a set of 16 parameters, highlighting its improved fidelity relative to multi-layer perceptron and convolutional neural network baselines. We also provide audio examples demonstrating the out-of-domain model performance in emulating vocal imitations, and sounds from other synthesizers and musical instruments.

Synthesizer Sound Matching Using Audio Spectrogram Transformers

TL;DR

The paper tackles synthesizer sound matching by inferring

continuous synthesis parameters from audio using a general, architecture-agnostic approach. It introduces an Audio Spectrogram Transformer (AST) as a regression backbone, trained on a large synthetic dataset from the Massive synthesizer, with a

-bin Mel spectrogram input and a

-layer MLP head, optimizing with

. The AST outperforms a 5-layer MLP and a 5-layer CNN baselines on both parameter reconstruction and audio fidelity measured by Spectral Convergence (SC), and shows promising out-of-domain generalization to vocal imitations and other instruments. This work demonstrates the viability of transformer-based, general-purpose synthesizer sound matching without differentiable synthesizers, offering a scalable path toward broader applicability across diverse sonic sources and control tasks.

Abstract

Paper Structure (11 sections, 1 figure, 2 tables)

This paper contains 11 sections, 1 figure, 2 tables.

Introduction and Related Work
Proposed Method
The Audio Spectrogram Transformer
AST for Synthesizer Sound Matching
Experimental Results
Dataset
Evaluation Method
Baselines
Results
Conclusion
Acknowledgments

Figures (1)

Figure 1: Proposed synthesizer sound matching system block diagram.

Synthesizer Sound Matching Using Audio Spectrogram Transformers

TL;DR

Abstract

Synthesizer Sound Matching Using Audio Spectrogram Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (1)