Table of Contents
Fetching ...

Neural Audio Codecs for Prompt-Driven Universal Sound Separation

Adhiraj Banerjee, Vipul Arora

TL;DR

CodecSep addresses the need for edge-friendly, open-domain sound separation by performing prompt-guided masking directly in neural audio codec latents. The method freezes a DAC backbone and applies a FiLM-conditioned Transformer masker, guided by CLAP text embeddings, to selectively pass codec latents without generating new content. It achieves superior SI-SDR compared with spectrogram-based baselines and maintains competitive perceptual quality while drastically reducing compute, memory, and latency in code-stream deployments. The approach demonstrates strong cross-domain generalization to open-domain datasets and flexibility to extend prompts to multi-modal modalities, making on-device universal sound separation more practical at scale.

Abstract

Text-guided sound separation supports flexible audio editing across media and assistive applications, but existing models like AudioSep are too compute-heavy for edge deployment. Neural audio codec (NAC) models such as CodecFormer and SDCodec are compute-efficient but limited to fixed-class separation. We introduce CodecSep, the first NAC-based model for on-device universal, text-driven separation. CodecSep combines DAC compression with a Transformer masker modulated by CLAP-derived FiLM parameters. Across six open-domain benchmarks under matched training/prompt protocols, \textbf{CodecSep} surpasses \textbf{AudioSep} in separation fidelity (SI-SDR) while remaining competitive in perceptual quality (ViSQOL) and matching or exceeding fixed-stem baselines (TDANet, CodecFormer, SDCodec). In code-stream deployments, it needs just 1.35~GMACs end-to-end -- approximately $54\times$ less compute ($25\times$ architecture-only) than spectrogram-domain separators like AudioSep -- while remaining fully bitstream-compatible.

Neural Audio Codecs for Prompt-Driven Universal Sound Separation

TL;DR

CodecSep addresses the need for edge-friendly, open-domain sound separation by performing prompt-guided masking directly in neural audio codec latents. The method freezes a DAC backbone and applies a FiLM-conditioned Transformer masker, guided by CLAP text embeddings, to selectively pass codec latents without generating new content. It achieves superior SI-SDR compared with spectrogram-based baselines and maintains competitive perceptual quality while drastically reducing compute, memory, and latency in code-stream deployments. The approach demonstrates strong cross-domain generalization to open-domain datasets and flexibility to extend prompts to multi-modal modalities, making on-device universal sound separation more practical at scale.

Abstract

Text-guided sound separation supports flexible audio editing across media and assistive applications, but existing models like AudioSep are too compute-heavy for edge deployment. Neural audio codec (NAC) models such as CodecFormer and SDCodec are compute-efficient but limited to fixed-class separation. We introduce CodecSep, the first NAC-based model for on-device universal, text-driven separation. CodecSep combines DAC compression with a Transformer masker modulated by CLAP-derived FiLM parameters. Across six open-domain benchmarks under matched training/prompt protocols, \textbf{CodecSep} surpasses \textbf{AudioSep} in separation fidelity (SI-SDR) while remaining competitive in perceptual quality (ViSQOL) and matching or exceeding fixed-stem baselines (TDANet, CodecFormer, SDCodec). In code-stream deployments, it needs just 1.35~GMACs end-to-end -- approximately less compute ( architecture-only) than spectrogram-domain separators like AudioSep -- while remaining fully bitstream-compatible.

Paper Structure

This paper contains 55 sections, 25 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: An overview of CodecSep. (Left) The full pipeline for text-guided USS. (Right) The integration of text conditioning into intermediate layers of transformer masker via FiLM layers.
  • Figure 2: Typical edge--server deployment comparing compute requirements of conventional audio-stream separators (audio in $\to$ codes out) versus CodecSep discrete inference (codes in $\to$ codes out).
  • Figure 3: Evaluation workflow for dnr-v2. Each mixture contains multi-source stems: speech (often multi-speaker), music (multi-instrument), and SFX ($\geq3$ overlapping events). Fixed-stem baselines predict a fixed set of outputs (e.g., 3 stems), whereas CodecSep and other text-guided models generate only the prompted source. Speech and music are evaluated using generic prompts, while SFX uses long-form compositional prompts listing all SFX events in each mixture. Extracted signals are compared with ground-truth category stems using SI-SDR and ViSQOL.
  • Figure 4: Evaluation workflow for the standardized three-source benchmarks (AudioCaps, ESC-50, Clotho, VGGSound, and AudioSet-eval). Following prior USS protocols, each mixture is constructed by combining three isolated events drawn from distinct classes. For each class, the corresponding textual prompt is supplied to the separator (e.g., “dog barking,” “gun shot,” “motor vehicle”), and the extracted signal is compared with the ground-truth isolated source using SI-SDR and ViSQOL.