Table of Contents
Fetching ...

Extending Segment Anything Model into Auditory and Temporal Dimensions for Audio-Visual Segmentation

Juhyeong Seon, Woobin Im, Sebin Lee, Jumin Lee, Sung-Eui Yoon

TL;DR

This work tackles Audio-Visual Segmentation (AVS) by extending the Segment Anything Model (SAM) to utilize temporal and auditory context. It introduces ST-BAVA, a Spatio-Temporal, Bidirectional Audio-Visual Attention module placed between SAM's image encoder and mask decoder, and complements it with Adapters to inject audio into the image pathway. The approach yields state-of-the-art AVS performance on AVSBench, notably achieving an 8.3% mIoU improvement on the challenging MS3 multi-source subset while keeping trainable parameters to less than 4% of SAM. This demonstrates effective cross-modal, spatio-temporal fusion in a foundation-model-based segmentation framework and suggests broader applicability for dense prediction tasks involving audio-visual data.

Abstract

Audio-visual segmentation (AVS) aims to segment sound sources in the video sequence, requiring a pixel-level understanding of audio-visual correspondence. As the Segment Anything Model (SAM) has strongly impacted extensive fields of dense prediction problems, prior works have investigated the introduction of SAM into AVS with audio as a new modality of the prompt. Nevertheless, constrained by SAM's single-frame segmentation scheme, the temporal context across multiple frames of audio-visual data remains insufficiently utilized. To this end, we study the extension of SAM's capabilities to the sequence of audio-visual scenes by analyzing contextual cross-modal relationships across the frames. To achieve this, we propose a Spatio-Temporal, Bidirectional Audio-Visual Attention (ST-BAVA) module integrated into the middle of SAM's image encoder and mask decoder. It adaptively updates the audio-visual features to convey the spatio-temporal correspondence between the video frames and audio streams. Extensive experiments demonstrate that our proposed model outperforms the state-of-the-art methods on AVS benchmarks, especially with an 8.3% mIoU gain on a challenging multi-sources subset.

Extending Segment Anything Model into Auditory and Temporal Dimensions for Audio-Visual Segmentation

TL;DR

This work tackles Audio-Visual Segmentation (AVS) by extending the Segment Anything Model (SAM) to utilize temporal and auditory context. It introduces ST-BAVA, a Spatio-Temporal, Bidirectional Audio-Visual Attention module placed between SAM's image encoder and mask decoder, and complements it with Adapters to inject audio into the image pathway. The approach yields state-of-the-art AVS performance on AVSBench, notably achieving an 8.3% mIoU improvement on the challenging MS3 multi-source subset while keeping trainable parameters to less than 4% of SAM. This demonstrates effective cross-modal, spatio-temporal fusion in a foundation-model-based segmentation framework and suggests broader applicability for dense prediction tasks involving audio-visual data.

Abstract

Audio-visual segmentation (AVS) aims to segment sound sources in the video sequence, requiring a pixel-level understanding of audio-visual correspondence. As the Segment Anything Model (SAM) has strongly impacted extensive fields of dense prediction problems, prior works have investigated the introduction of SAM into AVS with audio as a new modality of the prompt. Nevertheless, constrained by SAM's single-frame segmentation scheme, the temporal context across multiple frames of audio-visual data remains insufficiently utilized. To this end, we study the extension of SAM's capabilities to the sequence of audio-visual scenes by analyzing contextual cross-modal relationships across the frames. To achieve this, we propose a Spatio-Temporal, Bidirectional Audio-Visual Attention (ST-BAVA) module integrated into the middle of SAM's image encoder and mask decoder. It adaptively updates the audio-visual features to convey the spatio-temporal correspondence between the video frames and audio streams. Extensive experiments demonstrate that our proposed model outperforms the state-of-the-art methods on AVS benchmarks, especially with an 8.3% mIoU gain on a challenging multi-sources subset.
Paper Structure (23 sections, 3 equations, 5 figures, 3 tables)

This paper contains 23 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Segmentation results of different models on a film A Dog's Purpose (2017). (a) Segment Anything Model (SAM) segments target objects in the image with their regions guided by user prompts. (b) Prior works liu2023annotationwang2023prompting have adapted SAM to segment objects that sound with a corresponding audio prompt per frame. (c) We propose a spatio-temporal, bidirectional audio-visual attention (ST-BAVA), enabling SAM to fully leverage the relationships between the subsequent video frames and audio streams in a bidirectional way. In Fig. \ref{['fig:intro_stbava']}, our model successfully segments the human and the dog on the frames where they make sounds.
  • Figure 2: Overview of the proposed SAM with ST-BAVA. (a) Our model takes a sequence of video frames and audio streams as input and predicts the mask of the sound sources for each video frame. (b) ST-BAVA module bidirectionally updates the image and audio features with spatial and temporal attention in sequence. M.H. stands for the multi-head. The initial audio feature from the audio backbone is used as a positional encoding for the audio feature.
  • Figure 3: Effect of temporal attention in ST-BAVA on the audio-visual segmentation results. Our model leverages the temporal relationship across multiple frames, leading to accurate sound source predictions. Wrong prediction without temporal attention is marked in red boxes.
  • Figure 4: Qualitative comparison with existing methods. Our method accurately identifies sound sources across multiple frames and describes detailed object shapes, achieving the most accurate segmentation performance.
  • Figure 5: Spatial attention maps of the audio and visual embedding in the middle of our model pipeline. The attention map before ST-BAVA is calculated with the features extracted from the backbones. After the ST-BAVA, the map separately represents the region of sound sources within the frames, which leads to the correct segmentation of the sources in the predicted mask. Green-boxed regions show the visual information aggregated from other frames by temporal attention (the man with multiple arms).