Table of Contents
Fetching ...

NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining

Liang Zeng, Valerio Marsocci, Wufan Zhao, Andrea Nascetti, Maarten Vergauwen

TL;DR

Experimental results across various pretraining datasets and downstream tasks show that NeighborMAE significantly outperforms existing baselines, underscoring the value of neighboring images in Masked Image Modeling for Earth Observation and the efficacy of the designs.

Abstract

Masked Image Modeling has been one of the most popular self-supervised learning paradigms to learn representations from large-scale, unlabeled Earth Observation images. While incorporating multi-modal and multi-temporal Earth Observation data into Masked Image Modeling has been widely explored, the spatial dependencies between images captured from neighboring areas remains largely overlooked. Since the Earth's surface is continuous, neighboring images are highly related and offer rich contextual information for self-supervised learning. To close this gap, we propose NeighborMAE, which learns spatial dependencies by joint reconstruction of neighboring Earth Observation images. To ensure that the reconstruction remains challenging, we leverage a heuristic strategy to dynamically adjust the mask ratio and the pixel-level loss weight. Experimental results across various pretraining datasets and downstream tasks show that NeighborMAE significantly outperforms existing baselines, underscoring the value of neighboring images in Masked Image Modeling for Earth Observation and the efficacy of our designs.

NeighborMAE: Exploiting Spatial Dependencies between Neighboring Earth Observation Images in Masked Autoencoders Pretraining

TL;DR

Experimental results across various pretraining datasets and downstream tasks show that NeighborMAE significantly outperforms existing baselines, underscoring the value of neighboring images in Masked Image Modeling for Earth Observation and the efficacy of the designs.

Abstract

Masked Image Modeling has been one of the most popular self-supervised learning paradigms to learn representations from large-scale, unlabeled Earth Observation images. While incorporating multi-modal and multi-temporal Earth Observation data into Masked Image Modeling has been widely explored, the spatial dependencies between images captured from neighboring areas remains largely overlooked. Since the Earth's surface is continuous, neighboring images are highly related and offer rich contextual information for self-supervised learning. To close this gap, we propose NeighborMAE, which learns spatial dependencies by joint reconstruction of neighboring Earth Observation images. To ensure that the reconstruction remains challenging, we leverage a heuristic strategy to dynamically adjust the mask ratio and the pixel-level loss weight. Experimental results across various pretraining datasets and downstream tasks show that NeighborMAE significantly outperforms existing baselines, underscoring the value of neighboring images in Masked Image Modeling for Earth Observation and the efficacy of our designs.
Paper Structure (37 sections, 8 equations, 9 figures, 10 tables)

This paper contains 37 sections, 8 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: NeighborMAE jointly reconstructs masked regions across neighboring EO images by self-attention over all tokens. This design enables the model to capture spatial and other inherent dependencies (e.g., temporal) between neighboring observations.
  • Figure 2: The overview of NeighborMAE. We sample pairs of neighboring images from datasets based on geographic coordinates. Their relative positions are embedded in a shared coordinate system, and the mask ratio is chosen based on IoU. Neighboring images are jointly reconstructed by MAE, which learns the in-between spatial dependencies by self-attention. The reconstruction loss is weighted by the visibility from the masked input of neighboring images to avoid learning shortcuts.
  • Figure 3: Examples of different ways to extend the view of a base input (a) using a similar number of additional patches. Neighbors introduce diverse spatial variations for learning.
  • Figure 4: fMoW classification performance with different epochs pretraining on fMoW by NeighborMAE, MAE and SatMAE++. NeighborMAE consistently obtains the best performance.
  • Figure 5: Visualization of the reconstruction of neighboring images from fMoW-RGB. From left to right, we show pairs of neighboring images, masked images, prediction, cross-visible pixels, and the loss weight. Neighboring images from fMoW-RGB usually exhibit significant temporal changes and therefore our loss weighting by cross visibility has less impact.
  • ...and 4 more figures