Table of Contents
Fetching ...

USat: A Unified Self-Supervised Encoder for Multi-Sensor Satellite Imagery

Jeremy Irvin, Lucas Tao, Joanne Zhou, Yuntao Ma, Langston Nashold, Benjamin Liu, Andrew Y. Ng

TL;DR

USat introduces a unified self-supervised encoder designed for multi-sensor remote sensing data with heterogeneous spectral bands and varying ground sampling distances. The USat encoder uses per-band patch projections, spectral-group pooling, and a combination of superpositional, spectral-group, and sensor encodings to maintain geospatial alignment across sensors, enabling robust MAE-style pretraining (USatMAE). Empirical results on USatlas show that multi-sensor pretraining generally outperforms single-sensor baselines across EuroSAT, BigEarthNet, and METER-ML, with pronounced benefits in low-data regimes. The work demonstrates competitive performance against ImageNet pretraining and highlights practical implications for multi-sensor remote sensing, including improved transferability and flexibility in spectral-band usage.

Abstract

Large, self-supervised vision models have led to substantial advancements for automatically interpreting natural images. Recent works have begun tailoring these methods to remote sensing data which has rich structure with multi-sensor, multi-spectral, and temporal information providing massive amounts of self-labeled data that can be used for self-supervised pre-training. In this work, we develop a new encoder architecture called USat that can input multi-spectral data from multiple sensors for self-supervised pre-training. USat is a vision transformer with modified patch projection layers and positional encodings to model spectral bands with varying spatial scales from multiple sensors. We integrate USat into a Masked Autoencoder (MAE) self-supervised pre-training procedure and find that a pre-trained USat outperforms state-of-the-art self-supervised MAE models trained on remote sensing data on multiple remote sensing benchmark datasets (up to 8%) and leads to improvements in low data regimes (up to 7%). Code and pre-trained weights are available at https://github.com/stanfordmlgroup/USat .

USat: A Unified Self-Supervised Encoder for Multi-Sensor Satellite Imagery

TL;DR

USat introduces a unified self-supervised encoder designed for multi-sensor remote sensing data with heterogeneous spectral bands and varying ground sampling distances. The USat encoder uses per-band patch projections, spectral-group pooling, and a combination of superpositional, spectral-group, and sensor encodings to maintain geospatial alignment across sensors, enabling robust MAE-style pretraining (USatMAE). Empirical results on USatlas show that multi-sensor pretraining generally outperforms single-sensor baselines across EuroSAT, BigEarthNet, and METER-ML, with pronounced benefits in low-data regimes. The work demonstrates competitive performance against ImageNet pretraining and highlights practical implications for multi-sensor remote sensing, including improved transferability and flexibility in spectral-band usage.

Abstract

Large, self-supervised vision models have led to substantial advancements for automatically interpreting natural images. Recent works have begun tailoring these methods to remote sensing data which has rich structure with multi-sensor, multi-spectral, and temporal information providing massive amounts of self-labeled data that can be used for self-supervised pre-training. In this work, we develop a new encoder architecture called USat that can input multi-spectral data from multiple sensors for self-supervised pre-training. USat is a vision transformer with modified patch projection layers and positional encodings to model spectral bands with varying spatial scales from multiple sensors. We integrate USat into a Masked Autoencoder (MAE) self-supervised pre-training procedure and find that a pre-trained USat outperforms state-of-the-art self-supervised MAE models trained on remote sensing data on multiple remote sensing benchmark datasets (up to 8%) and leads to improvements in low data regimes (up to 7%). Code and pre-trained weights are available at https://github.com/stanfordmlgroup/USat .
Paper Structure (28 sections, 2 equations, 7 figures, 9 tables)

This paper contains 28 sections, 2 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Overview of the USat encoder architecture. USat can accept any subset of spectral bands (channels) and spatial patches from multiple satellite (Sentinel-2) and aerial imagery (NAIP) sensors. Each spectral band is independently patchified, with lower GSD (higher spatial resolution) bands divided into more patches than higher GSD (lower spatial resolution) bands. We embed each band with a separate patch projection layer whose outputs are then input to the spectral group pooling layer which combine corresponding patches from different bands to produce the per-patch embeddings. Each patch embedding is then summed with an encoding vector which captures the positional and spectral group information of each patch, where superpositional encodings are used for the higher GSD bands indicated with *, before being fed into a Transformer. Best viewed in color.
  • Figure 2: Test set performance of USatMAE against baselines models across varying training set sizes of downstream datasets. EuroSAT is evaluated using accuracy (Acc), BigEarthNet micro-average precision (mAP), and METER-ML macro-average precision (MAP).
  • Figure 3: Example superpositional encodings used in USatMAE. Images from sensors capture identical ground areas but GSD can be different between sensors, so we use superpositional encodings to capture the positional relationships. Specifically for higher GSD (lower spatial resolution) sensors, we use the mean of the lower GSD (higher spatial resolution) positional encodings.
  • Figure 4: Visualization of the cosine similarities between positional encodings of an image with 16x16 patches and the superpositional encodings of an image with 8x8 patches.
  • Figure 5: Class counts of the USatlas training and validation sets.
  • ...and 2 more figures