SynthScribe: Deep Multimodal Tools for Synthesizer Sound Retrieval and Exploration

Stephen Brade; Bryan Wang; Mauricio Sousa; Gregory Lee Newsome; Sageev Oore; Tovi Grossman

SynthScribe: Deep Multimodal Tools for Synthesizer Sound Retrieval and Exploration

Stephen Brade, Bryan Wang, Mauricio Sousa, Gregory Lee Newsome, Sageev Oore, Tovi Grossman

TL;DR

SynthScribe presents a full-stack multimodal system for synthesizer sound retrieval, modification, and creation, leveraging LAION-CLAP to connect text and audio with a Diva-based preset bank. By combining a multimodal search, a user-centered genetic mixing algorithm, and a text/audio-driven preset modification interface, the approach enables high-level control over timbres without retraining models for each synthesizer. Two user studies demonstrate reliable sound retrieval and meaningful modifications, while free usage observations show time savings and the discovery of surprising, original sounds. The work highlights a practical pathway to democratize sound design, enabling both novices and professionals to explore and invent timbres efficiently, with future work focusing on personalization, cross-synth support, and timbre-language customization.

Abstract

Synthesizers are powerful tools that allow musicians to create dynamic and original sounds. Existing commercial interfaces for synthesizers typically require musicians to interact with complex low-level parameters or to manage large libraries of premade sounds. To address these challenges, we implement SynthScribe -- a fullstack system that uses multimodal deep learning to let users express their intentions at a much higher level. We implement features which address a number of difficulties, namely 1) searching through existing sounds, 2) creating completely new sounds, 3) making meaningful modifications to a given sound. This is achieved with three main features: a multimodal search engine for a large library of synthesizer sounds; a user centered genetic algorithm by which completely new sounds can be created and selected given the users preferences; a sound editing support feature which highlights and gives examples for key control parameters with respect to a text or audio based query. The results of our user studies show SynthScribe is capable of reliably retrieving and modifying sounds while also affording the ability to create completely new sounds that expand a musicians creative horizon.

SynthScribe: Deep Multimodal Tools for Synthesizer Sound Retrieval and Exploration

TL;DR

Abstract

Paper Structure (46 sections, 8 figures, 2 tables)

This paper contains 46 sections, 8 figures, 2 tables.

Introduction
Related Work
Multimodal Deep Learning
Assisted Sound Synthesis
HCI and Parameter Exploration Problems
Formative Interviews
The Trouble of Finding the Right Sound
The Trouble of Modifying Sounds
The Importance of Originality
Design Goals
SynthScribe
Multimodal Search
Genetic Mixing
Preset Modification
Supporting Features
...and 31 more sections

Figures (8)

Figure 1: Figure displaying workflow that SynthScribe implements. Starting in the top left (orange), a user can use text to retrieve relevant synthesizer sounds from a default preset bank. To create completely new sounds, they can generate a completely new preset bank by using Genetic Mixing (turquoise) to create unique combinations of their favourite sounds from the default preset bank. Given a sound that is close to their desires, they can send it to the Preset Modification feature (green). Users can then express their desires for the modification in text and are provided with examples of how they can change the parameters to achieve a better sound.
Figure 2: Interface for the window where users can search through the default preset bank (visualized in this figure) or presets generated with the Genetic Mixing feature. Key functionality for the Multimodal Search are highlighted in orange and Genetic Mixing features are highlighted in turquoise.
Figure 3: Interface for the Preset Modification feature. Users can retrieve examples that are relevant to a desired effect by first using a text query using the same recommended format as the multimodal search. Examples relating to this text query are placed in columns of the "Examples" matrix. These examples are labelled numerically and can be refined using an audio search. The "old" column contains the sound the user wishes to modify. They can change this sound by clicking on cells in other columns. Each row corresponds to 10 possible changes for a group of parameters and the importance of these groups is highlighted using the green LEDs to the left of the parameter group names.
Figure 4: On the left is the Synthesizer parameter display which lets users access low-level synthesis parameters and the default Diva synthesizer interface. On the right is an Oscilloscope which provides a visualization for the shape of the waveform being emitted from the synthesizer.
Figure 5: Demonstration of a uniform crossover between two parent synthesizer sounds. The groups of parameters (Oscillators, Filters, ..., Effects) are swapped in their entirety to create the children.
...and 3 more figures

SynthScribe: Deep Multimodal Tools for Synthesizer Sound Retrieval and Exploration

TL;DR

Abstract

SynthScribe: Deep Multimodal Tools for Synthesizer Sound Retrieval and Exploration

Authors

TL;DR

Abstract

Table of Contents

Figures (8)