Table of Contents
Fetching ...

RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing

Fengxiang Wang, Yulin Wang, Mingshuo Chen, Haiyan Zhao, Yangang Sun, Shuo Wang, Hongzhen Wang, Di Wang, Long Lan, Wenjing Yang, Jing Zhang

TL;DR

RoMA tackles the scalability bottleneck of ViT-based RS models by introducing rotation-aware multi-scale autoregressive pretraining for Mamba backbones on large unlabeled RS data. It fuses adaptive rotation encoding with angular embeddings and a multi-scale token-prediction objective, leveraging a KV-cache–based full-image encoder to enable efficient next-token prediction. Across scene classification, change detection, and semantic segmentation, RoMA-pretrained Mamba backbones achieve higher accuracy and lower computational cost than ViT-based RSFMs, with substantial memory and speed gains at high resolutions. The work establishes scaling behavior for Mamba-based RSFMs in remote sensing and releases code and pretrained models to facilitate broader adoption.

Abstract

Recent advances in self-supervised learning for Vision Transformers (ViTs) have fueled breakthroughs in remote sensing (RS) foundation models. However, the quadratic complexity of self-attention poses a significant barrier to scalability, particularly for large models and high-resolution images. While the linear-complexity Mamba architecture offers a promising alternative, existing RS applications of Mamba remain limited to supervised tasks on small, domain-specific datasets. To address these challenges, we propose RoMA, a framework that enables scalable self-supervised pretraining of Mamba-based RS foundation models using large-scale, diverse, unlabeled data. RoMA enhances scalability for high-resolution images through a tailored auto-regressive learning strategy, incorporating two key innovations: 1) a rotation-aware pretraining mechanism combining adaptive cropping with angular embeddings to handle sparsely distributed objects with arbitrary orientations, and 2) multi-scale token prediction objectives that address the extreme variations in object scales inherent to RS imagery. Systematic empirical studies validate that Mamba adheres to RS data and parameter scaling laws, with performance scaling reliably as model and data size increase. Furthermore, experiments across scene classification, object detection, and semantic segmentation tasks demonstrate that RoMA-pretrained Mamba models consistently outperform ViT-based counterparts in both accuracy and computational efficiency. The source code and pretrained models will be released at https://github.com/MiliLab/RoMA.

RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing

TL;DR

RoMA tackles the scalability bottleneck of ViT-based RS models by introducing rotation-aware multi-scale autoregressive pretraining for Mamba backbones on large unlabeled RS data. It fuses adaptive rotation encoding with angular embeddings and a multi-scale token-prediction objective, leveraging a KV-cache–based full-image encoder to enable efficient next-token prediction. Across scene classification, change detection, and semantic segmentation, RoMA-pretrained Mamba backbones achieve higher accuracy and lower computational cost than ViT-based RSFMs, with substantial memory and speed gains at high resolutions. The work establishes scaling behavior for Mamba-based RSFMs in remote sensing and releases code and pretrained models to facilitate broader adoption.

Abstract

Recent advances in self-supervised learning for Vision Transformers (ViTs) have fueled breakthroughs in remote sensing (RS) foundation models. However, the quadratic complexity of self-attention poses a significant barrier to scalability, particularly for large models and high-resolution images. While the linear-complexity Mamba architecture offers a promising alternative, existing RS applications of Mamba remain limited to supervised tasks on small, domain-specific datasets. To address these challenges, we propose RoMA, a framework that enables scalable self-supervised pretraining of Mamba-based RS foundation models using large-scale, diverse, unlabeled data. RoMA enhances scalability for high-resolution images through a tailored auto-regressive learning strategy, incorporating two key innovations: 1) a rotation-aware pretraining mechanism combining adaptive cropping with angular embeddings to handle sparsely distributed objects with arbitrary orientations, and 2) multi-scale token prediction objectives that address the extreme variations in object scales inherent to RS imagery. Systematic empirical studies validate that Mamba adheres to RS data and parameter scaling laws, with performance scaling reliably as model and data size increase. Furthermore, experiments across scene classification, object detection, and semantic segmentation tasks demonstrate that RoMA-pretrained Mamba models consistently outperform ViT-based counterparts in both accuracy and computational efficiency. The source code and pretrained models will be released at https://github.com/MiliLab/RoMA.

Paper Structure

This paper contains 13 sections, 2 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Comparison of ViT ViT (pretrained with MAE MAE) and our Mamba model (pretrained with RoMA) in scene classification, change detection, and semantic segmentation. Mamba outperforms ViT while being more computationally and memory efficient for high-resolution images. Notably, Mamba-B achieves 1.56$\times$ faster inference and reduces GPU memory usage by 78.9% on 1248$\times$1248 resolution images (6084 tokens per image) on a single NVIDIA 4090 GPU (batch size = 2).
  • Figure 2: Comparison between our autoregressive pretraining strategy and the standard MAE method. (1) RoMA encodes all patches using a Mamba encoder, whereas MAE encodes only a randomly sampled subset. (2) RoMA predicts the next token in a sequence to capture continuity, while MAE only reconstructs masked patches.
  • Figure 3: Overview of the RoMA Pretraining Pipeline. The input image is first divided into patches, and high-value patches are selected for random rotation using the Adaptive Rotation Encoding Strategy. These patches are then tokenized and processed by the Mamba encoder. The encoded features undergo autoregressive next-token prediction, followed by a multi-scale strategy that computes loss at different scales for gradient updates.
  • Figure 4: Illustration of the Adaptive Rotation Encoding Strategy. (a) Pipeline of the Adaptive Rotation Encoding Strategy. LBP refers to Local Binary Pattern. (b) Random patch selection for rotation without adaptive selection. The random approach in (b) disrupts object information in the RS image.
  • Figure 5: Scaling with Data Volume and Model Size. Each experiment was conducted three times, and the average was reported as the final result. (a) We showcase the Mamba-Base model's performance on three downstream tasks after RoMA pretraining with different data scales. (b) We compare the performance of various Mamba model sizes on three downstream tasks, all pretrained with 4 million data using RoMA. Details on pretraining and downstream task configurations are provided in Section \ref{['section5-1']}.