Table of Contents
Fetching ...

PromptSep: Generative Audio Separation via Multimodal Prompting

Yutong Wen, Ke Chen, Prem Seetharaman, Oriol Nieto, Jiaqi Su, Rithesh Kumar, Minje Kim, Paris Smaragdis, Zeyu Jin, Justin Salamon

TL;DR

PromptSep addresses the limitations of language-only conditioning in audio source separation by enabling both extraction and removal through a diffusion-based framework. It introduces vocal imitation as an intuitive conditioning modality and leverages Sketch2Sound-based data augmentation to generate temporally aligned conditioning samples, enabling open-vocabulary target control. The approach combines text and imitation cues within a latent diffusion model (DiT) using VAE and FLAN-T5 components, trained with a diverse data pipeline, and evaluated across multiple benchmarks. Empirical results show state-of-the-art removal performance and strong imitation-guided extraction, while maintaining competitive results for language-based separation, with robust subjective quality across tasks. This work broadens practical audio separation capabilities and offers a scalable pathway for multimodal user control in real-world applications.

Abstract

Recent breakthroughs in language-queried audio source separation (LASS) have shown that generative models can achieve higher separation audio quality than traditional masking-based approaches. However, two key limitations restrict their practical use: (1) users often require operations beyond separation, such as sound removal; and (2) relying solely on text prompts can be unintuitive for specifying sound sources. In this paper, we propose PromptSep to extend LASS into a broader framework for general-purpose sound separation. PromptSep leverages a conditional diffusion model enhanced with elaborated data simulation to enable both audio extraction and sound removal. To move beyond text-only queries, we incorporate vocal imitation as an additional and more intuitive conditioning modality for our model, by incorporating Sketch2Sound as a data augmentation strategy. Both objective and subjective evaluations on multiple benchmarks demonstrate that PromptSep achieves state-of-the-art performance in sound removal and vocal-imitation-guided source separation, while maintaining competitive results on language-queried source separation.

PromptSep: Generative Audio Separation via Multimodal Prompting

TL;DR

PromptSep addresses the limitations of language-only conditioning in audio source separation by enabling both extraction and removal through a diffusion-based framework. It introduces vocal imitation as an intuitive conditioning modality and leverages Sketch2Sound-based data augmentation to generate temporally aligned conditioning samples, enabling open-vocabulary target control. The approach combines text and imitation cues within a latent diffusion model (DiT) using VAE and FLAN-T5 components, trained with a diverse data pipeline, and evaluated across multiple benchmarks. Empirical results show state-of-the-art removal performance and strong imitation-guided extraction, while maintaining competitive results for language-based separation, with robust subjective quality across tasks. This work broadens practical audio separation capabilities and offers a scalable pathway for multimodal user control in real-world applications.

Abstract

Recent breakthroughs in language-queried audio source separation (LASS) have shown that generative models can achieve higher separation audio quality than traditional masking-based approaches. However, two key limitations restrict their practical use: (1) users often require operations beyond separation, such as sound removal; and (2) relying solely on text prompts can be unintuitive for specifying sound sources. In this paper, we propose PromptSep to extend LASS into a broader framework for general-purpose sound separation. PromptSep leverages a conditional diffusion model enhanced with elaborated data simulation to enable both audio extraction and sound removal. To move beyond text-only queries, we incorporate vocal imitation as an additional and more intuitive conditioning modality for our model, by incorporating Sketch2Sound as a data augmentation strategy. Both objective and subjective evaluations on multiple benchmarks demonstrate that PromptSep achieves state-of-the-art performance in sound removal and vocal-imitation-guided source separation, while maintaining competitive results on language-queried source separation.

Paper Structure

This paper contains 15 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: The model architecture of PromptSep. Text and vocal imitation inputs can be used separately or combined.