Generation of Musical Timbres using a Text-Guided Diffusion Model
Weixuan Yuan, Qadeer Khan, Vladimir Golkov
TL;DR
This work introduces a text-guided, diffusion-based framework for generating and manipulating musical timbres at the single-note level. By operating in a latent spectral space learned via VQ-GAN and aligning text descriptions with timbre representations through contrastive learning, the method jointly models magnitude and phase to produce realistic timbres conditioned on natural language prompts. It enables both note-level timbre synthesis and targeted timbre editing through RePaint-inspired spectral inpainting and global transformations, offering a musician-centric workflow that preserves creative input while expanding timbre fertile ground. Quantitative and qualitative evaluations on NSynth demonstrate improvements in realism, diversity, and text-timbre alignment, with practical implications for composers and performers seeking flexible, non-existent timbres for electronic instruments and DAWs.
Abstract
In recent years, text-to-audio systems have achieved remarkable success, enabling the generation of complete audio segments directly from text descriptions. While these systems also facilitate music creation, the element of human creativity and deliberate expression is often limited. In contrast, the present work allows composers, arrangers, and performers to create the basic building blocks for music creation: audio of individual musical notes for use in electronic instruments and DAWs. Through text prompts, the user can specify the timbre characteristics of the audio. We introduce a system that combines a latent diffusion model and multi-modal contrastive learning to generate musical timbres conditioned on text descriptions. By jointly generating the magnitude and phase of the spectrogram, our method eliminates the need for subsequently running a phase retrieval algorithm, as related methods do. Audio examples, source code, and a web app are available at https://wxuanyuan.github.io/Musical-Note-Generation/
