Creative Text-to-Audio Generation via Synthesizer Programming

Manuel Cherep; Nikhil Singh; Jessica Shand

Creative Text-to-Audio Generation via Synthesizer Programming

Manuel Cherep, Nikhil Singh, Jessica Shand

TL;DR

This paper proposes CTAG, a text-to-audio generation method that replaces opaque neural latent spaces with a 78-parameter virtual modular synthesizer to produce abstract, sketch-like sounds aligned with text prompts. By optimizing synthesizer parameters via gradient-free Evolution Strategies against a LAION-CLAP semantic objective, CTAG achieves identifiable yet artistically interpretive outputs while enabling direct inspection, interpolation, and manipulation of the sound. The authors introduce a principled evaluation suite (classification, synthesis-quality descriptors, and a user study) and demonstrate that CTAG can outperform certain baselines in abstraction and artistic appeal, with results sustained across different prompt strategies and sound durations. The work highlights the potential of abstraction-focused, interpretable synthesis as a complementary path to neural audio generation, enabling broader creative exploration and practical tooling for sound designers. Open-source release is proposed to empower novices and experts to explore and extend abstract text-to-audio paradigms.

Abstract

Neural audio synthesis methods now allow specifying ideas in natural language. However, these methods produce results that cannot be easily tweaked, as they are based on large latent spaces and up to billions of uninterpretable parameters. We propose a text-to-audio generation method that leverages a virtual modular sound synthesizer with only 78 parameters. Synthesizers have long been used by skilled sound designers for media like music and film due to their flexibility and intuitive controls. Our method, CTAG, iteratively updates a synthesizer's parameters to produce high-quality audio renderings of text prompts that can be easily inspected and tweaked. Sounds produced this way are also more abstract, capturing essential conceptual features over fine-grained acoustic details, akin to how simple sketches can vividly convey visual concepts. Our results show how CTAG produces sounds that are distinctive, perceived as artistic, and yet similarly identifiable to recent neural audio synthesis models, positioning it as a valuable and complementary tool.

Creative Text-to-Audio Generation via Synthesizer Programming

TL;DR

Abstract

Paper Structure (39 sections, 2 equations, 6 figures, 9 tables, 1 algorithm)

This paper contains 39 sections, 2 equations, 6 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Sound Synthesis
Language-Sound Correspondence
Abstract Synthesis
Interpretable and Controllable Synthesis
The Synthesizer Programming Problem
Methods
Synthesizer
Optimization
Objective Function
Evaluation Metrics
Classification Experiments
Synthesis Quality
User Study
...and 24 more sections

Figures (6)

Figure 1: CTAG leverages a virtual modular synthesizer to generate sounds capturing the semantics of user-provided text prompts in a sketch-like way, rather than being acoustically literal. Spectrograms of auditory outputs corresponding to six text prompts showcase the range of sounds this approach can yield, accompanied by a fully interpretable and controllable parameter space.
Figure 2: High-level overview: we use the LAION-CLAP model wu2023large to compute the similarity between a user-provided text prompt and SynthAX's cherep2023synthax output. The optimization procedure iteratively adjusts the parameter settings.
Figure 3: Results from our ablation study; all experiments are conducted with ESC-50. (Top) CLAP maximization curves, averaged across 10 iterations for each of the 50 prompts. Colored bands show 95% confidence intervals. (Bottom) Classification accuracy, with error bars showing 95% confidence intervals. Top and bottom plots share colors. (Left) Performance of different algorithms, with hyperparameters tuned on ESC-10. LES is strongest in both optimization and downstream classification. (Center) Different sound durations; we find 2 seconds to be strongest. (Right) Impact of synthesizer architecture, finding strongest results from the Voice model. Parameter counts are given in parenthesis, such as (78) for Voice.
Figure 4: Spectrogram series as the result of linear interpolation of the synthesizer's parameters (1) from "Spray" (left) to "Machine gun" (right), (2) from "Train horn" to "Chainsaw", and (3) from "Train wagon" to "Engine revving". Each spectrogram in the sequence represents a step in the interpolation, highlighting the systematic shift in acoustic properties.
Figure 5: User study classification accuracy per prompt, for CTAG, AudioGen, and AudioLDM.
...and 1 more figures

Creative Text-to-Audio Generation via Synthesizer Programming

TL;DR

Abstract

Creative Text-to-Audio Generation via Synthesizer Programming

Authors

TL;DR

Abstract

Table of Contents

Figures (6)