SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer
Helin Wang, Jiarui Hai, Yen-Ju Lu, Karan Thakkar, Mounya Elhilali, Najim Dehak
TL;DR
SoloAudio introduces a latent-diffusion Transformer for target sound extraction that operates in a VAE latent space and is conditioned by CLAP-derived references. By using long skip connections, rotational position encodings, and classifier-free guidance, it achieves state-of-the-art results on both in-domain (FSD-Mix) and out-of-domain (AudioSet) data. The method benefits from synthetic training data generated by text-to-audio models, enabling strong zero-shot and few-shot generalization to unseen sounds. Empirical results across synthetic and real datasets, along with subjective evaluations, demonstrate robust target isolation and improved perceptual quality over prior approaches.
Abstract
In this paper, we introduce SoloAudio, a novel diffusion-based generative model for target sound extraction (TSE). Our approach trains latent diffusion models on audio, replacing the previous U-Net backbone with a skip-connected Transformer that operates on latent features. SoloAudio supports both audio-oriented and language-oriented TSE by utilizing a CLAP model as the feature extractor for target sounds. Furthermore, SoloAudio leverages synthetic audio generated by state-of-the-art text-to-audio models for training, demonstrating strong generalization to out-of-domain data and unseen sound events. We evaluate this approach on the FSD Kaggle 2018 mixture dataset and real data from AudioSet, where SoloAudio achieves the state-of-the-art results on both in-domain and out-of-domain data, and exhibits impressive zero-shot and few-shot capabilities. Source code and demos are released.
