SILA: Signal-to-Language Augmentation for Enhanced Control in Text-to-Audio Generation
Sonal Kumar, Prem Seetharaman, Justin Salamon, Dinesh Manocha, Oriol Nieto
TL;DR
SILA tackles the challenge of fine-grained control in text-to-audio generation by augmenting prompts with learned acoustic descriptors. It employs a caption-generation pipeline to produce rich, descriptor-rich captions and appends DSP-inspired descriptors to guide generation, demonstrated with a diffusion-transformer based TTA model conditioned on language representations. The approach yields improved alignment between text and audio (CLAP_scr) while maintaining plausible audio quality (FAD) and achieves strong subjective user satisfaction, enabling more precise manipulation of loudness, pitch, reverb, brightness, fade, noise, and duration. This work advances practical sound design by enabling precise acoustic control within flexible, model-agnostic generation frameworks, with potential for broader adoption in professional audio workflows.
Abstract
The field of text-to-audio generation has seen significant advancements, and yet the ability to finely control the acoustic characteristics of generated audio remains under-explored. In this paper, we introduce a novel yet simple approach to generate sound effects with control over key acoustic parameters such as loudness, pitch, reverb, fade, brightness, noise and duration, enabling creative applications in sound design and content creation. These parameters extend beyond traditional Digital Signal Processing (DSP) techniques, incorporating learned representations that capture the subtleties of how sound characteristics can be shaped in context, enabling a richer and more nuanced control over the generated audio. Our approach is model-agnostic and is based on learning the disentanglement between audio semantics and its acoustic features. Our approach not only enhances the versatility and expressiveness of text-to-audio generation but also opens new avenues for creative audio production and sound design. Our objective and subjective evaluation results demonstrate the effectiveness of our approach in producing high-quality, customizable audio outputs that align closely with user specifications.
