Table of Contents
Fetching ...

Selective Noise Suppression and Discriminative Mutual Interaction for Robust Audio-Visual Segmentation

Kai Peng, Yunzhe Shen, Miao Zhang, Leiye Liu, Yidong Han, Wei Ji, Jingjing Li, Yongri Piao, Huchuan Lu

Abstract

The ability to capture and segment sounding objects in dynamic visual scenes is crucial for the development of Audio-Visual Segmentation (AVS) tasks. While significant progress has been made in this area, the interaction between audio and visual modalities still requires further exploration. In this work, we aim to answer the following questions: How can a model effectively suppress audio noise while enhancing relevant audio information? How can we achieve discriminative interaction between the audio and visual modalities? To this end, we propose SDAVS, equipped with the Selective Noise-Resilient Processor (SNRP) module and the Discriminative Audio-Visual Mutual Fusion (DAMF) strategy. The proposed SNRP mitigates audio noise interference by selectively emphasizing relevant auditory cues, while DAMF ensures more consistent audio-visual representations. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on benchmark AVS datasets, especially in multi-source and complex scenes. \textit{The code and model are available at https://github.com/happylife-pk/SDAVS}.

Selective Noise Suppression and Discriminative Mutual Interaction for Robust Audio-Visual Segmentation

Abstract

The ability to capture and segment sounding objects in dynamic visual scenes is crucial for the development of Audio-Visual Segmentation (AVS) tasks. While significant progress has been made in this area, the interaction between audio and visual modalities still requires further exploration. In this work, we aim to answer the following questions: How can a model effectively suppress audio noise while enhancing relevant audio information? How can we achieve discriminative interaction between the audio and visual modalities? To this end, we propose SDAVS, equipped with the Selective Noise-Resilient Processor (SNRP) module and the Discriminative Audio-Visual Mutual Fusion (DAMF) strategy. The proposed SNRP mitigates audio noise interference by selectively emphasizing relevant auditory cues, while DAMF ensures more consistent audio-visual representations. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on benchmark AVS datasets, especially in multi-source and complex scenes. \textit{The code and model are available at https://github.com/happylife-pk/SDAVS}.
Paper Structure (19 sections, 9 equations, 10 figures, 12 tables)

This paper contains 19 sections, 9 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: In the first scene (a), the piano is the actual sounding object. However, during the piano performance, the interaction between the cat and the person interferes with the audio signal. In the second scene (b), the guitar is the true sounding object, while the person remains silent. Existing methods struggle to distinguish the correct sounding object.
  • Figure 2: The overall pipeline of the proposed SDAVS, which includes the Selective Noise-Resilient Processor (SNRP) and Discriminative Audio-Visual Mutual Fusion (DAMF) modules. The SNRP module first filters noise and enhances audio-relevant features. The refined video features and original audio features are then fed into the DAMF module aligning their perception regions, which performs discriminative fusion using spatial-temporal-channel (STC) enhancement and bidirectional cross-modal attention. In DAMF, different icon shapes represent distinct perception regions per modality. After applying DAMF, the icons become more consistent, indicating improved cross-modal alignment.
  • Figure 3: Qualitative comparison of AVSBench zhou2022audio, AVSegFormer gao2024avsegformer, AVSStone ma2024stepping and Ours on the AVSBench-object dataset. (a) demonstrates the accuracy of the model segmentation, and (b) demonstrates the model's suppression of the incorrect segmentation, i.e., the segmentation of non-sounding objects.
  • Figure 4: Qualitative comparison between AVSBench zhou2022audio, AVSegFormer gao2024avsegformer and Ours on the AVSS dataset, demonstrating our model's excellent ability to interpret semantic information.
  • Figure 5: feature maps visualizations in SNRP presented in sequence from left to right. (a) raw images, (b) ground truth masks, (c) feature maps before processing by SNRP, (d) feature maps after processing CFS, and (e) feature maps after processing SFS.
  • ...and 5 more figures