Extending Segment Anything Model into Auditory and Temporal Dimensions for Audio-Visual Segmentation
Juhyeong Seon, Woobin Im, Sebin Lee, Jumin Lee, Sung-Eui Yoon
TL;DR
This work tackles Audio-Visual Segmentation (AVS) by extending the Segment Anything Model (SAM) to utilize temporal and auditory context. It introduces ST-BAVA, a Spatio-Temporal, Bidirectional Audio-Visual Attention module placed between SAM's image encoder and mask decoder, and complements it with Adapters to inject audio into the image pathway. The approach yields state-of-the-art AVS performance on AVSBench, notably achieving an 8.3% mIoU improvement on the challenging MS3 multi-source subset while keeping trainable parameters to less than 4% of SAM. This demonstrates effective cross-modal, spatio-temporal fusion in a foundation-model-based segmentation framework and suggests broader applicability for dense prediction tasks involving audio-visual data.
Abstract
Audio-visual segmentation (AVS) aims to segment sound sources in the video sequence, requiring a pixel-level understanding of audio-visual correspondence. As the Segment Anything Model (SAM) has strongly impacted extensive fields of dense prediction problems, prior works have investigated the introduction of SAM into AVS with audio as a new modality of the prompt. Nevertheless, constrained by SAM's single-frame segmentation scheme, the temporal context across multiple frames of audio-visual data remains insufficiently utilized. To this end, we study the extension of SAM's capabilities to the sequence of audio-visual scenes by analyzing contextual cross-modal relationships across the frames. To achieve this, we propose a Spatio-Temporal, Bidirectional Audio-Visual Attention (ST-BAVA) module integrated into the middle of SAM's image encoder and mask decoder. It adaptively updates the audio-visual features to convey the spatio-temporal correspondence between the video frames and audio streams. Extensive experiments demonstrate that our proposed model outperforms the state-of-the-art methods on AVS benchmarks, especially with an 8.3% mIoU gain on a challenging multi-sources subset.
