Neural Audio Codecs for Prompt-Driven Universal Sound Separation
Adhiraj Banerjee, Vipul Arora
TL;DR
CodecSep addresses the need for edge-friendly, open-domain sound separation by performing prompt-guided masking directly in neural audio codec latents. The method freezes a DAC backbone and applies a FiLM-conditioned Transformer masker, guided by CLAP text embeddings, to selectively pass codec latents without generating new content. It achieves superior SI-SDR compared with spectrogram-based baselines and maintains competitive perceptual quality while drastically reducing compute, memory, and latency in code-stream deployments. The approach demonstrates strong cross-domain generalization to open-domain datasets and flexibility to extend prompts to multi-modal modalities, making on-device universal sound separation more practical at scale.
Abstract
Text-guided sound separation supports flexible audio editing across media and assistive applications, but existing models like AudioSep are too compute-heavy for edge deployment. Neural audio codec (NAC) models such as CodecFormer and SDCodec are compute-efficient but limited to fixed-class separation. We introduce CodecSep, the first NAC-based model for on-device universal, text-driven separation. CodecSep combines DAC compression with a Transformer masker modulated by CLAP-derived FiLM parameters. Across six open-domain benchmarks under matched training/prompt protocols, \textbf{CodecSep} surpasses \textbf{AudioSep} in separation fidelity (SI-SDR) while remaining competitive in perceptual quality (ViSQOL) and matching or exceeding fixed-stem baselines (TDANet, CodecFormer, SDCodec). In code-stream deployments, it needs just 1.35~GMACs end-to-end -- approximately $54\times$ less compute ($25\times$ architecture-only) than spectrogram-domain separators like AudioSep -- while remaining fully bitstream-compatible.
