Table of Contents
Fetching ...

Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning

Colorado J. Reed, Ritwik Gupta, Shufan Li, Sarah Brockman, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Candido, Matt Uyttendaele, Trevor Darrell

TL;DR

Scale-MAE tackles the problem of scale variability in remote sensing by introducing a scale-aware pretraining framework. It integrates Ground Sample Distance (GSD) based positional encoding and a progressive Laplacian-pyramid decoder into the MAE paradigm, enabling simultaneous reconstruction of low- and high-frequency information across scales. Empirically, Scale-MAE yields consistent gains in kNN classification across eight datasets and improves SpaceNet building segmentation transfer across evaluation scales, outperforming SatMAE, ConvMAE, and vanilla MAE. The work demonstrates strong multiscale transfer capability and highlights practical considerations for deploying scale-aware encoders in diverse remote sensing settings, with avenues for broader backbone compatibility and multimodal extension.

Abstract

Large, pretrained models are commonly finetuned with imagery that is heavily augmented to mimic different conditions and scales, with the resulting models used for various tasks with imagery from a range of spatial scales. Such models overlook scale-specific information in the data for scale-dependent domains, such as remote sensing. In this paper, we present Scale-MAE, a pretraining method that explicitly learns relationships between data at different, known scales throughout the pretraining process. Scale-MAE pretrains a network by masking an input image at a known input scale, where the area of the Earth covered by the image determines the scale of the ViT positional encoding, not the image resolution. Scale-MAE encodes the masked image with a standard ViT backbone, and then decodes the masked image through a bandpass filter to reconstruct low/high frequency images at lower/higher scales. We find that tasking the network with reconstructing both low/high frequency images leads to robust multiscale representations for remote sensing imagery. Scale-MAE achieves an average of a $2.4 - 5.6\%$ non-parametric kNN classification improvement across eight remote sensing datasets compared to current state-of-the-art and obtains a $0.9$ mIoU to $1.7$ mIoU improvement on the SpaceNet building segmentation transfer task for a range of evaluation scales.

Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning

TL;DR

Scale-MAE tackles the problem of scale variability in remote sensing by introducing a scale-aware pretraining framework. It integrates Ground Sample Distance (GSD) based positional encoding and a progressive Laplacian-pyramid decoder into the MAE paradigm, enabling simultaneous reconstruction of low- and high-frequency information across scales. Empirically, Scale-MAE yields consistent gains in kNN classification across eight datasets and improves SpaceNet building segmentation transfer across evaluation scales, outperforming SatMAE, ConvMAE, and vanilla MAE. The work demonstrates strong multiscale transfer capability and highlights practical considerations for deploying scale-aware encoders in diverse remote sensing settings, with avenues for broader backbone compatibility and multimodal extension.

Abstract

Large, pretrained models are commonly finetuned with imagery that is heavily augmented to mimic different conditions and scales, with the resulting models used for various tasks with imagery from a range of spatial scales. Such models overlook scale-specific information in the data for scale-dependent domains, such as remote sensing. In this paper, we present Scale-MAE, a pretraining method that explicitly learns relationships between data at different, known scales throughout the pretraining process. Scale-MAE pretrains a network by masking an input image at a known input scale, where the area of the Earth covered by the image determines the scale of the ViT positional encoding, not the image resolution. Scale-MAE encodes the masked image with a standard ViT backbone, and then decodes the masked image through a bandpass filter to reconstruct low/high frequency images at lower/higher scales. We find that tasking the network with reconstructing both low/high frequency images leads to robust multiscale representations for remote sensing imagery. Scale-MAE achieves an average of a non-parametric kNN classification improvement across eight remote sensing datasets compared to current state-of-the-art and obtains a mIoU to mIoU improvement on the SpaceNet building segmentation transfer task for a range of evaluation scales.
Paper Structure (34 sections, 2 equations, 8 figures, 11 tables)

This paper contains 34 sections, 2 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: $\textit{Scale-MAE}$ learns better representations for multiscale tasks compared to vanilla MAE. (Column 1) The top image spans an area at 0.3m GSD and the bottom image shows the same region at a coarser GSD. (Columns 2-4) The following columns show a ground truth building segmentation, $\textit{Scale-MAE}$ segmentation from a finetuned UperNet, and segmentation from an analogously finetuned UperNet from a vanilla MAE, respectively. $\textit{Scale-MAE}$ demonstrates better performance across images at both scales. See the supplementary material for more examples.
  • Figure 2: $\textit{Scale-MAE}$ employs the Masked Autoencoder framework. An input image is patchified and masked before being passed into an MAE encoder. A Ground Sample Distance Positional Encoding (GSDPE) is added to the encoder input, which scales the positional encodings to the area of ground covered. The $\textit{Scale-MAE}$ decoders has three stages: (1) Decoding, which uses a smaller number of transformer layers than MAE to decode the encoded values (2) Upsampling, which progressively deconvolves the decoded feature map to a larger size before being passed through the Laplacian Blocks (abbreviated LB, see \ref{['sec:scale-mae']}), (3) Reconstruction, which then reconstructs low and high frequency features at different scales. These outputs are used to compute an aggregate loss with ground truth low and high frequency features, where following super resolution literature anwar2020deep, an L1 loss is used for high frequency output to better reconstruct edges and an L2 loss is used for low frequency output to better reconstruct average values.
  • Figure 3: Ground Sample Distance Positional Encoding (GSDPE). (Left) Input images at the same pixel resolution but different GSDs are shown. The image on the bottom is a subset of the image on the top. (Center) This overlap in location, albeit at a different resolution, is reflected in the GSDPE. The finer image with smaller spatial extent is represented by a corresponding subsection of the overall sine wave on the bottom. (Right) A standard positional encoding is strictly dependent on the image resolution and uses the same embedding for both. The colors behind the sine waves show the intensity and quantization of the encoding.
  • Figure 4: $\textit{Scale-MAE}$ reconstruction. Examples from Functional Map of the World are shown. From left to right, an input image at 224x224 resolution is shown. Its corresponding mask is visualized as well. Columns 3 and 4 show the low and high frequency produced by the $\textit{Scale-MAE}$ decoder. The last column is the reconstruction obtained from summing the low and high frequency features together.
  • Figure 5: Learning better representations at all scales.$\textit{Scale-MAE}$ (blue) features perform better than state-of-the-art. We evaluate kNN accuracy on eight datasets with a large variance in GSD. $\textit{Scale-MAE}$ consistently produces better results at coarser resolutions. In addition to using evaluation datasets at different GSDs, to further test the multiscale representations, we create multiple test sets for each dataset in which we downsampled the full resolution validation set to coarser GSDs at fixed percentages: $X_{val}^{G\%}, G \in \{12.5, 25, 50, 100\}$, where EuroSat does not include the 12.5% because the images are at a resolution of 64px, our patch size is 16px, and an input image of 8px is too small.
  • ...and 3 more figures