Table of Contents
Fetching ...

Cinematic Audio Source Separation Using Visual Cues

Kang Zhang, Suyeon Lee, Arda Senocak, Joon Son Chung

Abstract

Cinematic Audio Source Separation (CASS) aims to decompose mixed film audio into speech, music, and sound effects, enabling applications like dubbing and remastering. Existing CASS approaches are audio-only, overlooking the inherent audio-visual nature of films, where sounds often align with visual cues. We present the first framework for audio-visual CASS (AV-CASS), leveraging visual context to enhance separation quality. Our method formulates CASS as a conditional generative modeling problem using conditional flow matching, enabling multimodal audio source separation. To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, scene videos for effects) and design a dedicated visual encoder for this dual-stream setup. Trained entirely on synthetic data, our model generalizes effectively to real-world cinematic content and achieves strong performance on synthetic, real-world, and audio-only CASS benchmarks. Code and demo are available at \url{https://cass-flowmatching.github.io}.

Cinematic Audio Source Separation Using Visual Cues

Abstract

Cinematic Audio Source Separation (CASS) aims to decompose mixed film audio into speech, music, and sound effects, enabling applications like dubbing and remastering. Existing CASS approaches are audio-only, overlooking the inherent audio-visual nature of films, where sounds often align with visual cues. We present the first framework for audio-visual CASS (AV-CASS), leveraging visual context to enhance separation quality. Our method formulates CASS as a conditional generative modeling problem using conditional flow matching, enabling multimodal audio source separation. To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, scene videos for effects) and design a dedicated visual encoder for this dual-stream setup. Trained entirely on synthetic data, our model generalizes effectively to real-world cinematic content and achieves strong performance on synthetic, real-world, and audio-only CASS benchmarks. Code and demo are available at \url{https://cass-flowmatching.github.io}.

Paper Structure

This paper contains 44 sections, 12 equations, 17 figures, 9 tables.

Figures (17)

  • Figure 1: Illustration of the Cinematic Audio Source Separation (CASS) task. The audio stream from a movie is separated into distinct tracks: speech, sound effects, and music.
  • Figure 2: Architecture of AV-CASS. The fusion module integrates visual features from the facial and scene encoders into $\bm{c}^V$, which serves as a conditioning input along with a mixture audio $\bm{s}^A$ for the vector field estimator $\bm{u}_\theta$.
  • Figure 3: Extraction of dual-stream visual inputs from a real-world cinematic video during inference. Since no architectural changes are required, the AV-CASS model can be used with real-world cinematic videos for inference.
  • Figure 4: Comparison of MRX, BandIt, and AV-CASS on a real-world movie sample. Input video frames $\bm{v}^f$ and $\bm{v}^s$ are shown at the top, with the input audio spectrogram $\bm{s}^A$ placed for each stem. Yellow boxes highlight the bicycle bell, red boxes indicate cheering, and dotted boxes show elements misplaced in non-target stems. The dotted pink box in BandIt’s FX shows speech artifacts. Best viewed when zoomed in. This sample can also be viewed in the supplementary video.
  • Figure 5: Comparison of our audio-only model (Ours-AO), DAVIS-Flow huang2025davisflow, and our audio-visual model (AV-CASS) on a clip from the AVDnR testset. The input video frames and the GT audio spectrograms are shown at the top. Yellow boxes highlight the bird chirping present in $\bm{v}^s$ and the FX tracks. Dotted boxes indicate misplaced segments. Better viewed when zoomed in.
  • ...and 12 more figures