Table of Contents
Fetching ...

Sketch2Sound: Controllable Audio Generation via Time-Varying Signals and Sonic Imitations

Hugo Flores García, Oriol Nieto, Justin Salamon, Bryan Pardo, Prem Seetharaman

TL;DR

Sketch2Sound introduces a lightweight method to steer a pre-trained text-to-audio latent diffusion transformer with time-varying controls—loudness, brightness, and pitch—plus vocal imitations. By adding a single linear projection per control to the diffusion latents and using dropout during fine-tuning, the model learns to follow gesture-like sonic controls while preserving text adherence. A novel median-filtering strategy creates sketchlike control signals, enabling users to trade temporal specificity for audio quality and textual alignment at inference time. Evaluations on VimSketch demonstrate strong fidelity to vocal imitations, robust text adherence, and competitive audio quality with a flexible, gesture-driven interface for sound designers.

Abstract

We present Sketch2Sound, a generative audio model capable of creating high-quality sounds from a set of interpretable time-varying control signals: loudness, brightness, and pitch, as well as text prompts. Sketch2Sound can synthesize arbitrary sounds from sonic imitations (i.e.,~a vocal imitation or a reference sound-shape). Sketch2Sound can be implemented on top of any text-to-audio latent diffusion transformer (DiT), and requires only 40k steps of fine-tuning and a single linear layer per control, making it more lightweight than existing methods like ControlNet. To synthesize from sketchlike sonic imitations, we propose applying random median filters to the control signals during training, allowing Sketch2Sound to be prompted using controls with flexible levels of temporal specificity. We show that Sketch2Sound can synthesize sounds that follow the gist of input controls from a vocal imitation while retaining the adherence to an input text prompt and audio quality compared to a text-only baseline. Sketch2Sound allows sound artists to create sounds with the semantic flexibility of text prompts and the expressivity and precision of a sonic gesture or vocal imitation. Sound examples are available at https://hugofloresgarcia.art/sketch2sound/.

Sketch2Sound: Controllable Audio Generation via Time-Varying Signals and Sonic Imitations

TL;DR

Sketch2Sound introduces a lightweight method to steer a pre-trained text-to-audio latent diffusion transformer with time-varying controls—loudness, brightness, and pitch—plus vocal imitations. By adding a single linear projection per control to the diffusion latents and using dropout during fine-tuning, the model learns to follow gesture-like sonic controls while preserving text adherence. A novel median-filtering strategy creates sketchlike control signals, enabling users to trade temporal specificity for audio quality and textual alignment at inference time. Evaluations on VimSketch demonstrate strong fidelity to vocal imitations, robust text adherence, and competitive audio quality with a flexible, gesture-driven interface for sound designers.

Abstract

We present Sketch2Sound, a generative audio model capable of creating high-quality sounds from a set of interpretable time-varying control signals: loudness, brightness, and pitch, as well as text prompts. Sketch2Sound can synthesize arbitrary sounds from sonic imitations (i.e.,~a vocal imitation or a reference sound-shape). Sketch2Sound can be implemented on top of any text-to-audio latent diffusion transformer (DiT), and requires only 40k steps of fine-tuning and a single linear layer per control, making it more lightweight than existing methods like ControlNet. To synthesize from sketchlike sonic imitations, we propose applying random median filters to the control signals during training, allowing Sketch2Sound to be prompted using controls with flexible levels of temporal specificity. We show that Sketch2Sound can synthesize sounds that follow the gist of input controls from a vocal imitation while retaining the adherence to an input text prompt and audio quality compared to a text-only baseline. Sketch2Sound allows sound artists to create sounds with the semantic flexibility of text prompts and the expressivity and precision of a sonic gesture or vocal imitation. Sound examples are available at https://hugofloresgarcia.art/sketch2sound/.

Paper Structure

This paper contains 11 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Fig. 2: At inference, larger median filters are more sketchlike and can lead to higher audio quality, while smaller filters are more precise and may lead to lower audio quality if the vocal imitations aren't precise enough, giving the sound artist a choice over this trade-off.