Cross-Scale MAE: A Tale of Multi-Scale Exploitation in Remote Sensing
Maofeng Tang, Andrei Cozma, Konstantinos Georgiou, Hairong Qi
TL;DR
Cross-Scale MAE tackles the pervasive multi-scale and misaligned-image challenge in remote sensing SSL by extending MAE with scale augmentation and cross-scale constraints. It integrates encoder-side contrastive consistency with a decoder-side cross-scale prediction and reconstruction, formalized as $\mathcal{L}=\mathcal{L}_{cc}+\mathcal{L}_{cp}+\mathcal{L}_{re}$, and demonstrates improved representations across diverse RS datasets and tasks. The approach achieves efficient pretraining on a single GPU using xFormers and shows robust performance gains over SatMAE and Scale-MAE on both classification and segmentation benchmarks. These results suggest that explicit cross-scale information sharing, combined with generative and discriminative signals, yields practical benefits for real-world multi-scale remote sensing analysis.
Abstract
Remote sensing images present unique challenges to image analysis due to the extensive geographic coverage, hardware limitations, and misaligned multi-scale images. This paper revisits the classical multi-scale representation learning problem but under the general framework of self-supervised learning for remote sensing image understanding. We present Cross-Scale MAE, a self-supervised model built upon the Masked Auto-Encoder (MAE).During pre-training, Cross-Scale MAE employs scale augmentation techniques and enforces cross-scale consistency constraints through both contrastive and generative losses to ensure consistent and meaningful representations well-suited for a wide range of downstream tasks. Further, our implementation leverages the xFormers library to accelerate network pre-training on a single GPU while maintaining the quality of learned representations. Experimental evaluations demonstrate that Cross-Scale MAE exhibits superior performance compared to standard MAE and other state-of-the-art remote sensing MAE methods.
