Table of Contents
Fetching ...

SAM Audio: Segment Anything in Audio

Bowen Shi, Andros Tjandra, John Hoffman, Helin Wang, Yi-Chiao Wu, Luya Gao, Julius Richter, Matt Le, Apoorv Vyas, Sanyuan Chen, Christoph Feichtenhofer, Piotr Dollár, Wei-Ning Hsu, Ann Lee

TL;DR

SAM Audio introduces a general-purpose audio separation model that unifies text, visual, and span prompting within a diffusion-transformer framework trained with flow matching in a DAC-VAE latent space. It demonstrates state-of-the-art performance across speech, music, instrument, and general sound separation, leveraging large-scale real, synthetic, and pseudo-labeled data, plus a novel data engine for prompts. The paper also contributes SAM Audio-Bench, a real-world, multimodal evaluation suite, and SAM Audio Judge (SAJ), a reference-free perceptual metric that correlates highly with human judgments. Together, these advances enable scalable, open-domain audio separation with flexible user prompting and robust evaluation, significantly impacting multimodal AI systems and audio engineering workflows.

Abstract

General audio source separation is a key capability for multimodal AI systems that can perceive and reason about sound. Despite substantial progress in recent years, existing separation models are either domain-specific, designed for fixed categories such as speech or music, or limited in controllability, supporting only a single prompting modality such as text. In this work, we present SAM Audio, a foundation model for general audio separation that unifies text, visual, and temporal span prompting within a single framework. Built on a diffusion transformer architecture, SAM Audio is trained with flow matching on large-scale audio data spanning speech, music, and general sounds, and can flexibly separate target sources described by language, visual masks, or temporal spans. The model achieves state-of-the-art performance across a diverse suite of benchmarks, including general sound, speech, music, and musical instrument separation in both in-the-wild and professionally produced audios, substantially outperforming prior general-purpose and specialized systems. Furthermore, we introduce a new real-world separation benchmark with human-labeled multimodal prompts and a reference-free evaluation model that correlates strongly with human judgment.

SAM Audio: Segment Anything in Audio

TL;DR

SAM Audio introduces a general-purpose audio separation model that unifies text, visual, and span prompting within a diffusion-transformer framework trained with flow matching in a DAC-VAE latent space. It demonstrates state-of-the-art performance across speech, music, instrument, and general sound separation, leveraging large-scale real, synthetic, and pseudo-labeled data, plus a novel data engine for prompts. The paper also contributes SAM Audio-Bench, a real-world, multimodal evaluation suite, and SAM Audio Judge (SAJ), a reference-free perceptual metric that correlates highly with human judgments. Together, these advances enable scalable, open-domain audio separation with flexible user prompting and robust evaluation, significantly impacting multimodal AI systems and audio engineering workflows.

Abstract

General audio source separation is a key capability for multimodal AI systems that can perceive and reason about sound. Despite substantial progress in recent years, existing separation models are either domain-specific, designed for fixed categories such as speech or music, or limited in controllability, supporting only a single prompting modality such as text. In this work, we present SAM Audio, a foundation model for general audio separation that unifies text, visual, and temporal span prompting within a single framework. Built on a diffusion transformer architecture, SAM Audio is trained with flow matching on large-scale audio data spanning speech, music, and general sounds, and can flexibly separate target sources described by language, visual masks, or temporal spans. The model achieves state-of-the-art performance across a diverse suite of benchmarks, including general sound, speech, music, and musical instrument separation in both in-the-wild and professionally produced audios, substantially outperforming prior general-purpose and specialized systems. Furthermore, we introduce a new real-world separation benchmark with human-labeled multimodal prompts and a reference-free evaluation model that correlates strongly with human judgment.

Paper Structure

This paper contains 88 sections, 5 equations, 19 figures, 23 tables.

Figures (19)

  • Figure 1: Overview of SAM Audio. Given an audio mixture, SAM Audio separates it into target and residual stems, conditioned on any combination of text descriptions (text prompts), visual masks (visual prompts), and temporal intervals (span prompts).
  • Figure 2: Illustration of our pseudo-labeling data synthesis pipeline. PLM-Audio generates text prompts from mixtures, which guide SAM Audio to produce target/residual stems. A filter stage retains only high-quality pseudo-labeled stems.
  • Figure 3: Illustration of pseudo-labeled visual data. The pseduo-labeling pipeline produces a text caption of the target audio, which is used to prompt SAM3 to obtain the visual mask.
  • Figure 4: Illustration of span generation. RMS energy (top) and Mel-spectrograms (bottom) for the target $x_{\mathrm{tgt}}$, residual $x_{\mathrm{res}}$, and mixture $x_{\mathrm{mix}}$. Yellow intervals denote detected spans corresponding to active sound events.
  • Figure 5: Summary of task, modality, and dataset coverage in SAM Audio-Bench. The modality abbreviations are as follows: "T" indicates the item can be used with a text-only prompt (e.g. for speaker separation this implies that the text description can be unambiguously associated with a single speaker), "V" indicates that the target sound is on-screen and that we have a SAM masklet provided and "S" denotes that there are event boundaries for the target sound.
  • ...and 14 more figures