A$^{2}$-MAE: A spatial-temporal-spectral unified remote sensing pre-training method based on anchor-aware masked autoencoder

Lixian Zhang; Yi Zhao; Runmin Dong; Jinxiao Zhang; Shuai Yuan; Shilei Cao; Mengxuan Chen; Juepeng Zheng; Weijia Li; Wei Liu; Wayne Zhang; Litong Feng; Haohuan Fu

A$^{2}$-MAE: A spatial-temporal-spectral unified remote sensing pre-training method based on anchor-aware masked autoencoder

Lixian Zhang, Yi Zhao, Runmin Dong, Jinxiao Zhang, Shuai Yuan, Shilei Cao, Mengxuan Chen, Juepeng Zheng, Weijia Li, Wei Liu, Wayne Zhang, Litong Feng, Haohuan Fu

TL;DR

This work tackles the fragmentation of remote sensing pre-training by introducing STSSD, a large-scale, multi-source dataset, and A$^{2}$-MAE, a unified masked autoencoder that leverages anchor-aware masking and geographic encoding. By integrating spatial, temporal, spectral information with geo-priors in a single backbone, the method learns robust representations across diverse RS sources. Empirical results across classification, segmentation, and change detection demonstrate consistent improvements over strong RS SSL baselines, while maintaining efficiency with a compact parameter footprint. The approach offers a practical path toward scalable, geography-aware RS foundation models and sets the stage for incorporating additional modalities like SAR and hyperspectral data in the future.

Abstract

Vast amounts of remote sensing (RS) data provide Earth observations across multiple dimensions, encompassing critical spatial, temporal, and spectral information which is essential for addressing global-scale challenges such as land use monitoring, disaster prevention, and environmental change mitigation. Despite various pre-training methods tailored to the characteristics of RS data, a key limitation persists: the inability to effectively integrate spatial, temporal, and spectral information within a single unified model. To unlock the potential of RS data, we construct a Spatial-Temporal-Spectral Structured Dataset (STSSD) characterized by the incorporation of multiple RS sources, diverse coverage, unified locations within image sets, and heterogeneity within images. Building upon this structured dataset, we propose an Anchor-Aware Masked AutoEncoder method (A$^{2}$-MAE), leveraging intrinsic complementary information from the different kinds of images and geo-information to reconstruct the masked patches during the pre-training phase. A$^{2}$-MAE integrates an anchor-aware masking strategy and a geographic encoding module to comprehensively exploit the properties of RS images. Specifically, the proposed anchor-aware masking strategy dynamically adapts the masking process based on the meta-information of a pre-selected anchor image, thereby facilitating the training on images captured by diverse types of RS sources within one model. Furthermore, we propose a geographic encoding method to leverage accurate spatial patterns, enhancing the model generalization capabilities for downstream applications that are generally location-related. Extensive experiments demonstrate our method achieves comprehensive improvements across various downstream tasks compared with existing RS pre-training methods, including image classification, semantic segmentation, and change detection tasks.

A$^{2}$-MAE: A spatial-temporal-spectral unified remote sensing pre-training method based on anchor-aware masked autoencoder

TL;DR

This work tackles the fragmentation of remote sensing pre-training by introducing STSSD, a large-scale, multi-source dataset, and A

-MAE, a unified masked autoencoder that leverages anchor-aware masking and geographic encoding. By integrating spatial, temporal, spectral information with geo-priors in a single backbone, the method learns robust representations across diverse RS sources. Empirical results across classification, segmentation, and change detection demonstrate consistent improvements over strong RS SSL baselines, while maintaining efficiency with a compact parameter footprint. The approach offers a practical path toward scalable, geography-aware RS foundation models and sets the stage for incorporating additional modalities like SAR and hyperspectral data in the future.

Abstract

-MAE), leveraging intrinsic complementary information from the different kinds of images and geo-information to reconstruct the masked patches during the pre-training phase. A

-MAE integrates an anchor-aware masking strategy and a geographic encoding module to comprehensively exploit the properties of RS images. Specifically, the proposed anchor-aware masking strategy dynamically adapts the masking process based on the meta-information of a pre-selected anchor image, thereby facilitating the training on images captured by diverse types of RS sources within one model. Furthermore, we propose a geographic encoding method to leverage accurate spatial patterns, enhancing the model generalization capabilities for downstream applications that are generally location-related. Extensive experiments demonstrate our method achieves comprehensive improvements across various downstream tasks compared with existing RS pre-training methods, including image classification, semantic segmentation, and change detection tasks.

Paper Structure (19 sections, 2 equations, 4 figures, 2 tables)

This paper contains 19 sections, 2 equations, 4 figures, 2 tables.

Introduction
Related Work
Large-scale datasets for remote sensing imagery pre-training
Self-supervised learning for satellite imagery
Geography-aware learning
Data
Overview
STSSD Construction
Methodology
Overall Architecture
Setup
Anchor-Aware Masking Strategy
Geographic Encoding
Experiments
Implementation Details and Baselines
...and 4 more sections

Figures (4)

Figure 1: An illustrative overlook of the proposed global-scale dataset STSSD and pre-training method A$^{2}$-MAE. STSSD is a comprehensive remote sensing dataset structured and characterized by the inclusion of diverse spatial, temporal, and spectral coverage. A$^{2}$-MAE facilitates the efficient utilization of the intrinsic complementarity information in STSSD within one unified spatial-temporal-spectral model. Lon., Lat., and GSD indicate longitude, latitude, and ground spatial distance information, respectively.
Figure 2: The compositions of image sets and the global sampling location distribution. (a) shows the S2-L8 image sets. "ng" denotes the non-growth period, and "g" denotes the growth period within one year. (b) shows the GF-S2 image sets. (c) is the sampling location distribution (i.e., purple circles for S2-L8 in urban areas, green circles for S2-L8 in nature reserves, and pink circles for GF-S2).
Figure 3: The overall framework of A$^{2}$-MAE. A$^{2}$-MAE incorporates an anchor-aware masking strategy and a geographic encoding module, allowing for the efficient utilization of spatial-, temporal-, and spectral-variant information in large-scale RS imagery.
Figure 4: The proposed geographic encoding module in A$^{2}$-MAE. By efficiently encoding the geographic metadata (i.e., a $Geo_{GSD}$ and four sets of $(Geo_{Lat}^{c}, Geo_{Lon}^{c})$ in RS images, the geographic encoding module encourages A$^{2}$-MAE to be explicitly aware of this crucial geographic prior for varying geographic information in images.

A$^{2}$-MAE: A spatial-temporal-spectral unified remote sensing pre-training method based on anchor-aware masked autoencoder

TL;DR

Abstract

A$^{2}$-MAE: A spatial-temporal-spectral unified remote sensing pre-training method based on anchor-aware masked autoencoder

Authors

TL;DR

Abstract

Table of Contents

Figures (4)