Table of Contents
Fetching ...

Continual Audio-Visual Sound Separation

Weiguo Pian, Yiyang Nan, Shijian Deng, Shentong Mo, Yunhui Guo, Yapeng Tian

TL;DR

ContAV-Sep presents a novel Cross-modal Similarity Distillation Constraint (CrossSDC) to uphold the cross-modal semantic similarity through incremental tasks and retain previously acquired knowledge of semantic similarity in old models, mitigating the risk of catastrophic forgetting.

Abstract

In this paper, we introduce a novel continual audio-visual sound separation task, aiming to continuously separate sound sources for new classes while preserving performance on previously learned classes, with the aid of visual guidance. This problem is crucial for practical visually guided auditory perception as it can significantly enhance the adaptability and robustness of audio-visual sound separation models, making them more applicable for real-world scenarios where encountering new sound sources is commonplace. The task is inherently challenging as our models must not only effectively utilize information from both modalities in current tasks but also preserve their cross-modal association in old tasks to mitigate catastrophic forgetting during audio-visual continual learning. To address these challenges, we propose a novel approach named ContAV-Sep (\textbf{Cont}inual \textbf{A}udio-\textbf{V}isual Sound \textbf{Sep}aration). ContAV-Sep presents a novel Cross-modal Similarity Distillation Constraint (CrossSDC) to uphold the cross-modal semantic similarity through incremental tasks and retain previously acquired knowledge of semantic similarity in old models, mitigating the risk of catastrophic forgetting. The CrossSDC can seamlessly integrate into the training process of different audio-visual sound separation frameworks. Experiments demonstrate that ContAV-Sep can effectively mitigate catastrophic forgetting and achieve significantly better performance compared to other continual learning baselines for audio-visual sound separation. Code is available at: \url{https://github.com/weiguoPian/ContAV-Sep_NeurIPS2024}.

Continual Audio-Visual Sound Separation

TL;DR

ContAV-Sep presents a novel Cross-modal Similarity Distillation Constraint (CrossSDC) to uphold the cross-modal semantic similarity through incremental tasks and retain previously acquired knowledge of semantic similarity in old models, mitigating the risk of catastrophic forgetting.

Abstract

In this paper, we introduce a novel continual audio-visual sound separation task, aiming to continuously separate sound sources for new classes while preserving performance on previously learned classes, with the aid of visual guidance. This problem is crucial for practical visually guided auditory perception as it can significantly enhance the adaptability and robustness of audio-visual sound separation models, making them more applicable for real-world scenarios where encountering new sound sources is commonplace. The task is inherently challenging as our models must not only effectively utilize information from both modalities in current tasks but also preserve their cross-modal association in old tasks to mitigate catastrophic forgetting during audio-visual continual learning. To address these challenges, we propose a novel approach named ContAV-Sep (\textbf{Cont}inual \textbf{A}udio-\textbf{V}isual Sound \textbf{Sep}aration). ContAV-Sep presents a novel Cross-modal Similarity Distillation Constraint (CrossSDC) to uphold the cross-modal semantic similarity through incremental tasks and retain previously acquired knowledge of semantic similarity in old models, mitigating the risk of catastrophic forgetting. The CrossSDC can seamlessly integrate into the training process of different audio-visual sound separation frameworks. Experiments demonstrate that ContAV-Sep can effectively mitigate catastrophic forgetting and achieve significantly better performance compared to other continual learning baselines for audio-visual sound separation. Code is available at: \url{https://github.com/weiguoPian/ContAV-Sep_NeurIPS2024}.

Paper Structure

This paper contains 23 sections, 14 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Top: Illustration of the continual audio-visual sound separation task, where the model (separator) learns from sequential audio-visual sound separation tasks. Bottom: Illustration of the catastrophic forgetting problem in continual audio-visual sound separation and its mitigation by our proposed method. Fine-tuning: Directly fine-tune the separation model on new sound source classes; Upper bound: Train the model using all training data from seen sound source classes.
  • Figure 2: Overview of our proposed ContAV-Sep, which consists of an audio-visual sound separation base model architecture, an Output Mask Distillation, and our proposed Cross-modal Similarity Distillation Constraint. The fire icon denotes the module is trainable, while the snowflake icon denotes that the module is frozen. The (i)STFT stands for (inverse) Short-Time Fourier Transform. Please note that, the old model $\mathcal{F}_{\boldsymbol{\Theta}_{t-1}}$ is frozen during training.
  • Figure 3: Testing results of different continual learning methods with iQuery chen2023iquery on the metrics of (a) SDR, (b) SIR, and (c) SAR at each incremental step.
  • Figure 4: Testing results with different memory size (number of samples per class in the memory) on the metrics of (a) SDR, (b) SIR, and (c) SAR at each incremental step.
  • Figure 5: Left: a randomly selected sample with its frame and ground-truth spectrogram. Right: separated sounds by our ContAV-Sep and baselines at each incremental step.
  • ...and 5 more figures