Table of Contents
Fetching ...

SenPa-MAE: Sensor Parameter Aware Masked Autoencoder for Multi-Satellite Self-Supervised Pretraining

Jonathan Prexl, Michael Schmitt

TL;DR

SenPa-MAE addresses the challenge of building a sensor-agnostic Earth observation foundation model by integrating sensor parameters directly into the embedding process. It extends masked autoencoding to multispectral imagery with a dedicated sensor-parameter encoding module for spectral response functions $\{\boldsymbol{\lambda}_c\}$ and ground sampling distances $\{\sigma_c\}$, and introduces Spectral Superposition Augmentation to diversify training data. The approach enables cross-sensor pretraining across Landsat, Sentinel-2, and Planet-SuperDove imagery, yielding more robust zero-shot and fine-tuned performance on multi-sensor land-cover tasks. This sensor-aware pretraining framework advances toward sensor-independent inference and cross-sensor fusion, with potential applicability to broader EO tasks and hyperspectral data.

Abstract

This paper introduces SenPa-MAE, a transformer architecture that encodes the sensor parameters of an observed multispectral signal into the image embeddings. SenPa-MAE can be pre-trained on imagery of different satellites with non-matching spectral or geometrical sensor characteristics. To incorporate sensor parameters, we propose a versatile sensor parameter encoding module as well as a data augmentation strategy for the diversification of the pre-training dataset. This enables the model to effectively differentiate between various sensors and gain an understanding of sensor parameters and the correlation to the observed signal. Given the rising number of Earth observation satellite missions and the diversity in their sensor specifications, our approach paves the way towards a sensor-independent Earth observation foundation model. This opens up possibilities such as cross-sensor training and sensor-independent inference.

SenPa-MAE: Sensor Parameter Aware Masked Autoencoder for Multi-Satellite Self-Supervised Pretraining

TL;DR

SenPa-MAE addresses the challenge of building a sensor-agnostic Earth observation foundation model by integrating sensor parameters directly into the embedding process. It extends masked autoencoding to multispectral imagery with a dedicated sensor-parameter encoding module for spectral response functions and ground sampling distances , and introduces Spectral Superposition Augmentation to diversify training data. The approach enables cross-sensor pretraining across Landsat, Sentinel-2, and Planet-SuperDove imagery, yielding more robust zero-shot and fine-tuned performance on multi-sensor land-cover tasks. This sensor-aware pretraining framework advances toward sensor-independent inference and cross-sensor fusion, with potential applicability to broader EO tasks and hyperspectral data.

Abstract

This paper introduces SenPa-MAE, a transformer architecture that encodes the sensor parameters of an observed multispectral signal into the image embeddings. SenPa-MAE can be pre-trained on imagery of different satellites with non-matching spectral or geometrical sensor characteristics. To incorporate sensor parameters, we propose a versatile sensor parameter encoding module as well as a data augmentation strategy for the diversification of the pre-training dataset. This enables the model to effectively differentiate between various sensors and gain an understanding of sensor parameters and the correlation to the observed signal. Given the rising number of Earth observation satellite missions and the diversity in their sensor specifications, our approach paves the way towards a sensor-independent Earth observation foundation model. This opens up possibilities such as cross-sensor training and sensor-independent inference.
Paper Structure (8 sections, 4 equations, 3 figures, 2 tables)

This paper contains 8 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The spectral response functions $\boldsymbol{\lambda}_c$ for the Sensors Landsat, Sentinel-2 and Planet SuperDove as a function of the wavelength. Each $\boldsymbol{\lambda}_c$ has values ranges in $[0,1]$, and shifts along the $y$ axis are purely to distinguish different sensors and different sets of $\sigma_c$ within the set of channels. The atmospheric transmittance spectrum (ATS) of the Earth, crucial when designing channels, is provided as a reference. Shaded blue boxes below the ATS indicate the position of the channels relative to the ATS.
  • Figure 2: The proposed SenPa-MAE architecture as well as the baseline model BaseMAE. After patch embedding, the tokens $\mathbf{t}_i$ undergo a three-step encoding procedure where information about patch position, spectral response function, as well as ground sampling distance, gets added. After the encoding, all tokens get processed in an MAE-like encoder decoder setup. An analogue encoding procedure can be applied before the decoding step (compare \ref{['sec:CPI']}). The BaseMAE setup differs from SenPa-MAE by purely taking the positional encoding into account and therefore, neglecting the injection of the sensor parameters.
  • Figure 3: The reconstruction result for a masked four-channel input. From left to right: Channel-specific mask, reconstructed signal for the masked areas, ground truth image with the GSD of the channel, as well as the spectral response function.