Table of Contents
Fetching ...

M$^3$amba: CLIP-driven Mamba Model for Multi-modal Remote Sensing Classification

Mingxiang Cao, Weiying Xie, Xin Zhang, Jiaqing Zhang, Kai Jiang, Jie Lei, Yunsong Li

TL;DR

A novel end-to-end CLIP-driven Mamba model for multi-modal fusion, which introduces CLIP-driven modality-specific adapters in the fusion architecture to avoid the bias of understanding specific domains caused by direct inference, making the original CLIP encoder modality-specific perception more accurate.

Abstract

Multi-modal fusion holds great promise for integrating information from different modalities. However, due to a lack of consideration for modal consistency, existing multi-modal fusion methods in the field of remote sensing still face challenges of incomplete semantic information and low computational efficiency in their fusion designs. Inspired by the observation that the visual language pre-training model CLIP can effectively extract strong semantic information from visual features, we propose M$^3$amba, a novel end-to-end CLIP-driven Mamba model for multi-modal fusion to address these challenges. Specifically, we introduce CLIP-driven modality-specific adapters in the fusion architecture to avoid the bias of understanding specific domains caused by direct inference, making the original CLIP encoder modality-specific perception. This unified framework enables minimal training to achieve a comprehensive semantic understanding of different modalities, thereby guiding cross-modal feature fusion. To further enhance the consistent association between modality mappings, a multi-modal Mamba fusion architecture with linear complexity and a cross-attention module Cross-SS2D are designed, which fully considers effective and efficient information interaction to achieve complete fusion. Extensive experiments have shown that M$^3$amba has an average performance improvement of at least 5.98\% compared with the state-of-the-art methods in multi-modal hyperspectral image classification tasks in the remote sensing field, while also demonstrating excellent training efficiency, achieving a double improvement in accuracy and efficiency. The code is released at https://github.com/kaka-Cao/M3amba.

M$^3$amba: CLIP-driven Mamba Model for Multi-modal Remote Sensing Classification

TL;DR

A novel end-to-end CLIP-driven Mamba model for multi-modal fusion, which introduces CLIP-driven modality-specific adapters in the fusion architecture to avoid the bias of understanding specific domains caused by direct inference, making the original CLIP encoder modality-specific perception more accurate.

Abstract

Multi-modal fusion holds great promise for integrating information from different modalities. However, due to a lack of consideration for modal consistency, existing multi-modal fusion methods in the field of remote sensing still face challenges of incomplete semantic information and low computational efficiency in their fusion designs. Inspired by the observation that the visual language pre-training model CLIP can effectively extract strong semantic information from visual features, we propose Mamba, a novel end-to-end CLIP-driven Mamba model for multi-modal fusion to address these challenges. Specifically, we introduce CLIP-driven modality-specific adapters in the fusion architecture to avoid the bias of understanding specific domains caused by direct inference, making the original CLIP encoder modality-specific perception. This unified framework enables minimal training to achieve a comprehensive semantic understanding of different modalities, thereby guiding cross-modal feature fusion. To further enhance the consistent association between modality mappings, a multi-modal Mamba fusion architecture with linear complexity and a cross-attention module Cross-SS2D are designed, which fully considers effective and efficient information interaction to achieve complete fusion. Extensive experiments have shown that Mamba has an average performance improvement of at least 5.98\% compared with the state-of-the-art methods in multi-modal hyperspectral image classification tasks in the remote sensing field, while also demonstrating excellent training efficiency, achieving a double improvement in accuracy and efficiency. The code is released at https://github.com/kaka-Cao/M3amba.

Paper Structure

This paper contains 21 sections, 10 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: Left: Proposed CLIP-driven modality-specific adapters guide the fusion process to produce complete features through semantic information interactions. Right: Common methods perform pairwise fusion by training encoders and fusion networks, which lacks consideration of semantic consistency and leads to incomplete representation. $I_1$ and $I_2$ represent inputs from different modalities.
  • Figure 2: Overview of the M$^{3}$amba framework. For clarity, we split the end-to-end process of training into three stages: Feature Adaptation, Mamba Fusion with Cross-SS2D, and Training Objective. By utilizing the M$^{3}$amba framework to perform a feature-level fusion of the two modalities, we can apply the complete fusion features to different downstream tasks. MLP Head consists of several convolutional and linear layers, and CE stands for Cross Entropy.
  • Figure 3: Comparison of box plots with other methods on three datasets.
  • Figure 4: t-SNE for ablation on three datasets. The results from top to bottom correspond to Houston2013, Augsburg , and MUUFL datasets respectively.
  • Figure 5: Visualization of false-color HSI and LiDAR images using different comparison methods based on the Houston2013 dataset. H and L respectively indicate that our method is trained using only HSI or LiDAR data.
  • ...and 2 more figures