Table of Contents
Fetching ...

High-Quality Visually-Guided Sound Separation from Diverse Categories

Chao Huang, Susan Liang, Yapeng Tian, Anurag Kumar, Chenliang Xu

TL;DR

Compared to existing state-of-the-art discriminative audio-visual separation methods on the AVE and MUSIC datasets, results show that DAVIS outperforms other methods in separation quality, demonstrating the advantages of the framework for tackling the audio-visual source separation task.

Abstract

We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning. Existing methods typically frame sound separation as a mask-based regression problem, achieving significant progress. However, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse categories. In contrast, DAVIS leverages a generative diffusion model and a Separation U-Net to synthesize separated sounds directly from Gaussian noise, conditioned on both the audio mixture and the visual information. With its generative objective, DAVIS is better suited to achieving the goal of high-quality sound separation across diverse sound categories. We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the AVE and MUSIC datasets, and results show that DAVIS outperforms other methods in separation quality, demonstrating the advantages of our framework for tackling the audio-visual source separation task.

High-Quality Visually-Guided Sound Separation from Diverse Categories

TL;DR

Compared to existing state-of-the-art discriminative audio-visual separation methods on the AVE and MUSIC datasets, results show that DAVIS outperforms other methods in separation quality, demonstrating the advantages of the framework for tackling the audio-visual source separation task.

Abstract

We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning. Existing methods typically frame sound separation as a mask-based regression problem, achieving significant progress. However, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse categories. In contrast, DAVIS leverages a generative diffusion model and a Separation U-Net to synthesize separated sounds directly from Gaussian noise, conditioned on both the audio mixture and the visual information. With its generative objective, DAVIS is better suited to achieving the goal of high-quality sound separation across diverse sound categories. We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the AVE and MUSIC datasets, and results show that DAVIS outperforms other methods in separation quality, demonstrating the advantages of our framework for tackling the audio-visual source separation task.
Paper Structure (14 sections, 5 equations, 9 figures, 4 tables)

This paper contains 14 sections, 5 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Separation results on diverse time-frequency structures are shown for SOTA discriminative methods and our proposed DAVIS. Each row displays the audio mixture, reference visual frame, ground truth magnitude, and predicted magnitudes from DAVIS, iQuery chen2023iquery, and CCoL tian2021cyclic. DAVIS successfully recovers suppressed time-frequency structures (highlighted in the box), where mask-regression methods fail.
  • Figure 2: Overview of the DAVIS framework. We aim to synthesize $x_0$ from the mixture $x^{mix}$, visual stream $v$, and timestep $t$. Starting with $x_T$ from a standard distribution, we encode $v=\{I_j\}_{j=1}^K$ and $t$ into the embedding space. A temporal transformer generates the visual feature $\boldsymbol{v}$, which, along with $\boldsymbol{t}$, conditions the Separation U-Net $\epsilon\theta$ to iteratively denoise $x_T$ into $x_0$. $\boldsymbol{v}$ is used only in the Feature Interaction Module for audio-visual association, while $\boldsymbol{t}$ is used throughout.
  • Figure 2: Ablation on CA block design. R, TF, and T denote ResNet, Time-Frequency, and Time Attention blocks, respectively. We highlight the setting used in this paper in gray.
  • Figure 3: Illustrations on (a) CA block: It operates by taking audio feature maps and a time embedding $\boldsymbol{t}$ as inputs. Each sub-block, except the up/down sampling layer, is conditioned on $\boldsymbol{t}$. ResNet and attention blocks are stacked to capture local and non-local audio contexts; (b) Audio-Visual Feature Interaction Module: It functions by replicating and concatenating $\boldsymbol{v}$ with $\boldsymbol{f_a}$, and uses two identical ResNet blocks and an attention block to process the concatenated features.
  • Figure 4: Visualizations of audio-visual separation results on the AVE (the top three mixtures) and MUSIC (the last mixture) datasets. Two sounds are mixed, and reference frames are provided to guide the separation. The comparison is shown between the predictions made by DAVIS (ours), iQuery chen2023iquery, and CCoL tian2021cyclic with the ground truth. DAVIS can effectively separate sound mixtures from various categories, such as airplane, rats, and dog barking.
  • ...and 4 more figures