Table of Contents
Fetching ...

AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation

Sitong Gong, Yunzhi Zhuge, Lu Zhang, Yifan Wang, Pingping Zhang, Lijun Wang, Huchuan Lu

TL;DR

AVS-Mamba tackles audio-visual segmentation by addressing long-range temporal dependencies with a selective state space model, achieving linear complexity rather than the quadratic cost of Transformers. The framework introduces a Multi-scale Temporal Encoder (MTE) with a Visual State Space (VSS) Block and Temporal Mamba Block, a Modality Aggregation Decoder (MAD) with Vision-to-Audio fusion at frame- and temporal-level, and a Contextual Integration Pyramid (CIP) for cross-frame audio-to-vision integration. It demonstrates state-of-the-art results on AVSBench-object and AVSBench-semantic datasets, validated by extensive ablations and qualitative analyses. The work provides a scalable, cross-modal approach to AVS that enhances cross-scale temporal coherence and audio-guided visual localization, with open-source code at AVS-Mamba.

Abstract

The essence of audio-visual segmentation (AVS) lies in locating and delineating sound-emitting objects within a video stream. While Transformer-based methods have shown promise, their handling of long-range dependencies struggles due to quadratic computational costs, presenting a bottleneck in complex scenarios. To overcome this limitation and facilitate complex multi-modal comprehension with linear complexity, we introduce AVS-Mamba, a selective state space model to address the AVS task. Our framework incorporates two key components for video understanding and cross-modal learning: Temporal Mamba Block for sequential video processing and Vision-to-Audio Fusion Block for advanced audio-vision integration. Building on this, we develop the Multi-scale Temporal Encoder, aimed at enhancing the learning of visual features across scales, facilitating the perception of intra- and inter-frame information. To perform multi-modal fusion, we propose the Modality Aggregation Decoder, leveraging the Vision-to-Audio Fusion Block to integrate visual features into audio features across both frame and temporal levels. Further, we adopt the Contextual Integration Pyramid to perform audio-to-vision spatial-temporal context collaboration. Through these innovative contributions, our approach achieves new state-of-the-art results on the AVSBench-object and AVSBench-semantic datasets. Our source code and model weights are available at AVS-Mamba.

AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation

TL;DR

AVS-Mamba tackles audio-visual segmentation by addressing long-range temporal dependencies with a selective state space model, achieving linear complexity rather than the quadratic cost of Transformers. The framework introduces a Multi-scale Temporal Encoder (MTE) with a Visual State Space (VSS) Block and Temporal Mamba Block, a Modality Aggregation Decoder (MAD) with Vision-to-Audio fusion at frame- and temporal-level, and a Contextual Integration Pyramid (CIP) for cross-frame audio-to-vision integration. It demonstrates state-of-the-art results on AVSBench-object and AVSBench-semantic datasets, validated by extensive ablations and qualitative analyses. The work provides a scalable, cross-modal approach to AVS that enhances cross-scale temporal coherence and audio-guided visual localization, with open-source code at AVS-Mamba.

Abstract

The essence of audio-visual segmentation (AVS) lies in locating and delineating sound-emitting objects within a video stream. While Transformer-based methods have shown promise, their handling of long-range dependencies struggles due to quadratic computational costs, presenting a bottleneck in complex scenarios. To overcome this limitation and facilitate complex multi-modal comprehension with linear complexity, we introduce AVS-Mamba, a selective state space model to address the AVS task. Our framework incorporates two key components for video understanding and cross-modal learning: Temporal Mamba Block for sequential video processing and Vision-to-Audio Fusion Block for advanced audio-vision integration. Building on this, we develop the Multi-scale Temporal Encoder, aimed at enhancing the learning of visual features across scales, facilitating the perception of intra- and inter-frame information. To perform multi-modal fusion, we propose the Modality Aggregation Decoder, leveraging the Vision-to-Audio Fusion Block to integrate visual features into audio features across both frame and temporal levels. Further, we adopt the Contextual Integration Pyramid to perform audio-to-vision spatial-temporal context collaboration. Through these innovative contributions, our approach achieves new state-of-the-art results on the AVSBench-object and AVSBench-semantic datasets. Our source code and model weights are available at AVS-Mamba.
Paper Structure (36 sections, 19 equations, 10 figures, 7 tables)

This paper contains 36 sections, 19 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Comparison with previous Transformer-based methods. (a) Previous methods use cross-attention in the Decoder to query visual features with audio features. (b) Our AVS-Mamba serializes visual features into 1D sequences and merges them with audio features through selective scanning mechanism.
  • Figure 2: The overall architecture of our proposed AVS-Mamba. The Multi-scale Temporal Encoder, incorporating VMamba and Temporal Mamba structures, processes spatial and temporal associations of multi-scale visual features. The Modality Aggregation Decoder utilizes the V2A Fusion Block for multi-directional serialization and integration of multi-modal features, enhancing information transfer from visual to audio modalities. Finally, the Contextual Integration Pyramid, integrating the FPN structure with refined Mamba modules, facilitates deep spatial-temporal interactions and cross-scale feature fusion.
  • Figure 3: The architecture of the Temporal Mamba Block. Initially, the video data undergoes 3D depth-wise convolution for local feature extraction, followed by modeling temporal relations using the 3D Selective Scan technique. The resulting data are multiplied by the weight parameters and passed through a residual connection, yielding visual features with enhanced temporal awareness.
  • Figure 4: The architecture of the Vision-to-Audio Fusion Block. Initially, the audio feature undergoes linear mapping followed by causal 1D convolution. Subsequently, the processed audio interacts with visual features at two levels: (i) frame-level and (ii) temporal-level, through the V2A Selective Scan Block. The resulting fused audio features are then combined with the original features via weighted multiplication and seamlessly integrated into the network using a residual connection.
  • Figure 5: The architecture of the Contextual Integration Pyramid. (a) Cross-frame audio-to-vision accumulation is performed using the Context Fusion Block, which is coupled with bilinear upsampling to align feature resolution. (b) Within each Context Fusion Block, the Temporal Mamba is employed to facilitate the exchange of temporal information across different scales of visual features. (c) We then introduce the A2V Selective Scan Block, which integrates standard audio features into the visual features through the State Space Model.
  • ...and 5 more figures