AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation
Sitong Gong, Yunzhi Zhuge, Lu Zhang, Yifan Wang, Pingping Zhang, Lijun Wang, Huchuan Lu
TL;DR
AVS-Mamba tackles audio-visual segmentation by addressing long-range temporal dependencies with a selective state space model, achieving linear complexity rather than the quadratic cost of Transformers. The framework introduces a Multi-scale Temporal Encoder (MTE) with a Visual State Space (VSS) Block and Temporal Mamba Block, a Modality Aggregation Decoder (MAD) with Vision-to-Audio fusion at frame- and temporal-level, and a Contextual Integration Pyramid (CIP) for cross-frame audio-to-vision integration. It demonstrates state-of-the-art results on AVSBench-object and AVSBench-semantic datasets, validated by extensive ablations and qualitative analyses. The work provides a scalable, cross-modal approach to AVS that enhances cross-scale temporal coherence and audio-guided visual localization, with open-source code at AVS-Mamba.
Abstract
The essence of audio-visual segmentation (AVS) lies in locating and delineating sound-emitting objects within a video stream. While Transformer-based methods have shown promise, their handling of long-range dependencies struggles due to quadratic computational costs, presenting a bottleneck in complex scenarios. To overcome this limitation and facilitate complex multi-modal comprehension with linear complexity, we introduce AVS-Mamba, a selective state space model to address the AVS task. Our framework incorporates two key components for video understanding and cross-modal learning: Temporal Mamba Block for sequential video processing and Vision-to-Audio Fusion Block for advanced audio-vision integration. Building on this, we develop the Multi-scale Temporal Encoder, aimed at enhancing the learning of visual features across scales, facilitating the perception of intra- and inter-frame information. To perform multi-modal fusion, we propose the Modality Aggregation Decoder, leveraging the Vision-to-Audio Fusion Block to integrate visual features into audio features across both frame and temporal levels. Further, we adopt the Contextual Integration Pyramid to perform audio-to-vision spatial-temporal context collaboration. Through these innovative contributions, our approach achieves new state-of-the-art results on the AVSBench-object and AVSBench-semantic datasets. Our source code and model weights are available at AVS-Mamba.
