Table of Contents
Fetching ...

SatMamba: Development of Foundation Models for Remote Sensing Imagery Using State Space Models

Chuc Man Duc, Hiromichi Fukui

TL;DR

SatMamba presents a novel pretraining framework that fuses masked autoencoding with a state-space based Mamba backbone to achieve linear computational scaling for remote sensing foundation models, addressing the quadratic cost of Vision Transformers on long, multispectral sequences. Through SatMamba-B variants and ablations, the work demonstrates competitive pretraining performance against ViT-based MAE methods and strong fine-tuning results on high-resolution semantic segmentation and building-damage assessment tasks. Key findings show that multi-directional Mamba scanning and certain positional encoding choices influence performance, with SatMamba-B w/o pos achieving the top quantitative metrics in several experiments, albeit with higher initial computation and memory demands. Overall, SatMamba broadens the toolkit for earth observation foundation models by enabling scalable, image-to-image pretraining and enabling future expansion to multispectral and multitemporal data domains.

Abstract

Foundation models refer to deep learning models pretrained on large unlabeled datasets through self-supervised algorithms. In the Earth science and remote sensing communities, there is growing interest in transforming the use of Earth observation data, including satellite and aerial imagery, through foundation models. Various foundation models have been developed for remote sensing, such as those for multispectral, high-resolution, and hyperspectral images, and have demonstrated superior performance on various downstream tasks compared to traditional supervised models. These models are evolving rapidly, with capabilities to handle multispectral, multitemporal, and multisensor data. Most studies use masked autoencoders in combination with Vision Transformers (ViTs) as the backbone for pretraining. While the models showed promising performance, ViTs face challenges, such as quadratic computational scaling with input length, which may limit performance on multiband and multitemporal data with long sequences. This research aims to address these challenges by proposing SatMamba, a new pretraining framework that combines masked autoencoders with State Space Model, offering linear computational scaling. Experiments on high-resolution imagery across various downstream tasks show promising results, paving the way for more efficient foundation models and unlocking the full potential of Earth observation data. The source code is available in https://github.com/mdchuc/HRSFM.

SatMamba: Development of Foundation Models for Remote Sensing Imagery Using State Space Models

TL;DR

SatMamba presents a novel pretraining framework that fuses masked autoencoding with a state-space based Mamba backbone to achieve linear computational scaling for remote sensing foundation models, addressing the quadratic cost of Vision Transformers on long, multispectral sequences. Through SatMamba-B variants and ablations, the work demonstrates competitive pretraining performance against ViT-based MAE methods and strong fine-tuning results on high-resolution semantic segmentation and building-damage assessment tasks. Key findings show that multi-directional Mamba scanning and certain positional encoding choices influence performance, with SatMamba-B w/o pos achieving the top quantitative metrics in several experiments, albeit with higher initial computation and memory demands. Overall, SatMamba broadens the toolkit for earth observation foundation models by enabling scalable, image-to-image pretraining and enabling future expansion to multispectral and multitemporal data domains.

Abstract

Foundation models refer to deep learning models pretrained on large unlabeled datasets through self-supervised algorithms. In the Earth science and remote sensing communities, there is growing interest in transforming the use of Earth observation data, including satellite and aerial imagery, through foundation models. Various foundation models have been developed for remote sensing, such as those for multispectral, high-resolution, and hyperspectral images, and have demonstrated superior performance on various downstream tasks compared to traditional supervised models. These models are evolving rapidly, with capabilities to handle multispectral, multitemporal, and multisensor data. Most studies use masked autoencoders in combination with Vision Transformers (ViTs) as the backbone for pretraining. While the models showed promising performance, ViTs face challenges, such as quadratic computational scaling with input length, which may limit performance on multiband and multitemporal data with long sequences. This research aims to address these challenges by proposing SatMamba, a new pretraining framework that combines masked autoencoders with State Space Model, offering linear computational scaling. Experiments on high-resolution imagery across various downstream tasks show promising results, paving the way for more efficient foundation models and unlocking the full potential of Earth observation data. The source code is available in https://github.com/mdchuc/HRSFM.

Paper Structure

This paper contains 20 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Details of the proposed SatMamba architecture. The details of the encoder are shown in the upper part of the figure. The decoder architecture is essentially similar to the encoder but smaller.
  • Figure 2: Pretraining results of SatMamba-B and ViTMAE-B on the fMoW dataset: (a) Ablation experiments with different scanning directions over 100 epochs; (b) Full pretraining results of SatMamba-B using four scanning directions over 800 epochs.
  • Figure 3: Resource requirements of different models across varying input sizes: (a) Computational requirements; (b) Memory requirements.