Table of Contents
Fetching ...

Cross-Scale MAE: A Tale of Multi-Scale Exploitation in Remote Sensing

Maofeng Tang, Andrei Cozma, Konstantinos Georgiou, Hairong Qi

TL;DR

Cross-Scale MAE tackles the pervasive multi-scale and misaligned-image challenge in remote sensing SSL by extending MAE with scale augmentation and cross-scale constraints. It integrates encoder-side contrastive consistency with a decoder-side cross-scale prediction and reconstruction, formalized as $\mathcal{L}=\mathcal{L}_{cc}+\mathcal{L}_{cp}+\mathcal{L}_{re}$, and demonstrates improved representations across diverse RS datasets and tasks. The approach achieves efficient pretraining on a single GPU using xFormers and shows robust performance gains over SatMAE and Scale-MAE on both classification and segmentation benchmarks. These results suggest that explicit cross-scale information sharing, combined with generative and discriminative signals, yields practical benefits for real-world multi-scale remote sensing analysis.

Abstract

Remote sensing images present unique challenges to image analysis due to the extensive geographic coverage, hardware limitations, and misaligned multi-scale images. This paper revisits the classical multi-scale representation learning problem but under the general framework of self-supervised learning for remote sensing image understanding. We present Cross-Scale MAE, a self-supervised model built upon the Masked Auto-Encoder (MAE).During pre-training, Cross-Scale MAE employs scale augmentation techniques and enforces cross-scale consistency constraints through both contrastive and generative losses to ensure consistent and meaningful representations well-suited for a wide range of downstream tasks. Further, our implementation leverages the xFormers library to accelerate network pre-training on a single GPU while maintaining the quality of learned representations. Experimental evaluations demonstrate that Cross-Scale MAE exhibits superior performance compared to standard MAE and other state-of-the-art remote sensing MAE methods.

Cross-Scale MAE: A Tale of Multi-Scale Exploitation in Remote Sensing

TL;DR

Cross-Scale MAE tackles the pervasive multi-scale and misaligned-image challenge in remote sensing SSL by extending MAE with scale augmentation and cross-scale constraints. It integrates encoder-side contrastive consistency with a decoder-side cross-scale prediction and reconstruction, formalized as , and demonstrates improved representations across diverse RS datasets and tasks. The approach achieves efficient pretraining on a single GPU using xFormers and shows robust performance gains over SatMAE and Scale-MAE on both classification and segmentation benchmarks. These results suggest that explicit cross-scale information sharing, combined with generative and discriminative signals, yields practical benefits for real-world multi-scale remote sensing analysis.

Abstract

Remote sensing images present unique challenges to image analysis due to the extensive geographic coverage, hardware limitations, and misaligned multi-scale images. This paper revisits the classical multi-scale representation learning problem but under the general framework of self-supervised learning for remote sensing image understanding. We present Cross-Scale MAE, a self-supervised model built upon the Masked Auto-Encoder (MAE).During pre-training, Cross-Scale MAE employs scale augmentation techniques and enforces cross-scale consistency constraints through both contrastive and generative losses to ensure consistent and meaningful representations well-suited for a wide range of downstream tasks. Further, our implementation leverages the xFormers library to accelerate network pre-training on a single GPU while maintaining the quality of learned representations. Experimental evaluations demonstrate that Cross-Scale MAE exhibits superior performance compared to standard MAE and other state-of-the-art remote sensing MAE methods.
Paper Structure (12 sections, 6 equations, 3 figures, 7 tables)

This paper contains 12 sections, 6 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: The architecture of Cross-Scale MAE comprises an encoder and a decoder (Left). The encoder (Top-Right) employs a vision transformer (ViT) backbone, specifically a ViT-base with 12 self-attention blocks. The decoder (Bottom-Right) uses a lightweight ViT backbone with 8 self-attention blocks. A single satellite image undergoes scale augmentation through random cropping and resizing. The contrastive loss is computed using encoder outputs for the two scale inputs. The cross-prediction loss (in MSE) is applied to the last self-attention block's decoder output. Reconstruction loss compares the predicted masked patch and the actual input.
  • Figure 2: The KNN classification of Cross-Scale MAE for different datasets.
  • Figure 3: The comparison of multi-scale augmentation and GSD positional encoding.