Table of Contents
Fetching ...

Lightweight, Pre-trained Transformers for Remote Sensing Timeseries

Gabriel Tseng, Ruben Cartuyvels, Ivan Zvonkov, Mirali Purohit, David Rolnick, Hannah Kerner

TL;DR

Presto addresses the challenge of limited labeled data in remote sensing by using a lightweight, self-supervised Transformer tailored to pixel-timeseries from multiple sensors. Through masked autoencoding and structured masking over 12-month pixel-timeseries with 15 dynamic channels and static metadata, Presto learns transferable representations that perform well across diverse tasks with far less compute than larger models. The approach yields strong results in timeseries, image, and image-timeseries settings, with ablations confirming the benefits of structured masking, pretraining, and scalable model size. This work demonstrates practical deployment potential for global-scale remote sensing pipelines, offering transfer learning and efficient feature extraction for practitioners with limited resources.

Abstract

Machine learning methods for satellite data have a range of societally relevant applications, but labels used to train models can be difficult or impossible to acquire. Self-supervision is a natural solution in settings with limited labeled data, but current self-supervised models for satellite data fail to take advantage of the characteristics of that data, including the temporal dimension (which is critical for many applications, such as monitoring crop growth) and availability of data from many complementary sensors (which can significantly improve a model's predictive performance). We present Presto (the Pretrained Remote Sensing Transformer), a model pre-trained on remote sensing pixel-timeseries data. By designing Presto specifically for remote sensing data, we can create a significantly smaller but performant model. Presto excels at a wide variety of globally distributed remote sensing tasks and performs competitively with much larger models while requiring far less compute. Presto can be used for transfer learning or as a feature extractor for simple models, enabling efficient deployment at scale.

Lightweight, Pre-trained Transformers for Remote Sensing Timeseries

TL;DR

Presto addresses the challenge of limited labeled data in remote sensing by using a lightweight, self-supervised Transformer tailored to pixel-timeseries from multiple sensors. Through masked autoencoding and structured masking over 12-month pixel-timeseries with 15 dynamic channels and static metadata, Presto learns transferable representations that perform well across diverse tasks with far less compute than larger models. The approach yields strong results in timeseries, image, and image-timeseries settings, with ablations confirming the benefits of structured masking, pretraining, and scalable model size. This work demonstrates practical deployment potential for global-scale remote sensing pipelines, offering transfer learning and efficient feature extraction for practitioners with limited resources.

Abstract

Machine learning methods for satellite data have a range of societally relevant applications, but labels used to train models can be difficult or impossible to acquire. Self-supervision is a natural solution in settings with limited labeled data, but current self-supervised models for satellite data fail to take advantage of the characteristics of that data, including the temporal dimension (which is critical for many applications, such as monitoring crop growth) and availability of data from many complementary sensors (which can significantly improve a model's predictive performance). We present Presto (the Pretrained Remote Sensing Transformer), a model pre-trained on remote sensing pixel-timeseries data. By designing Presto specifically for remote sensing data, we can create a significantly smaller but performant model. Presto excels at a wide variety of globally distributed remote sensing tasks and performs competitively with much larger models while requiring far less compute. Presto can be used for transfer learning or as a feature extractor for simple models, enabling efficient deployment at scale.
Paper Structure (30 sections, 1 equation, 8 figures, 16 tables)

This paper contains 30 sections, 1 equation, 8 figures, 16 tables.

Figures (8)

  • Figure 1: Presto learns from structurally-masked remote sensing pixel-timeseries. We construct a multi-sensor remote sensing pixel-timeseries, and randomly select one of the four masking strategies described in Section \ref{['sec:masking']}. The encoder-decoder model is trained to reconstruct the original timeseries. At fine-tuning time, we discard the decoder and only use the encoder's output. The downstream task may have incomplete inputs (missing timesteps or sensors) since the encoder is specifically trained on such inputs. Presto receives both static-in-time and dynamic-in-time inputs and the location metadata of each pixel timeseries.
  • Figure 2: Presto learns to reconstruct channels that are completely masked in a spatially cohesive manner. In this experiment, we masked only the Sentinel-2 RGB channels; Presto was able to reconstruct these channels even when they were absent from the input. The reconstructions are spatially consistent even though Presto only receives single pixel inputs.
  • Figure 3: Presto is robust to incomplete inputs. We measured the AUC ROC score of Presto with Linear probing (Presto$_{R}$) on the CropHarvest dataset when no Dynamic World input is passed, and with a subset of input months (the x-axis). We plot the performance of MOSAIKS-1D and TIML when they receive the full 12 months of input (dashed horizontal lines) - Presto$_{R}$ recovered the performance of these models given only a subset of input months.
  • Figure 4: We obtained per-image predictions using Presto by computing a mean and standard deviation of Presto's per-pixel outputs, and passing this concatenated vector to a downstream classifier. We illustrate this for the EuroSat task.
  • Figure 5: EuroSat accuracy of a kNN@5 classifier given pre-trained model embeddings at a variety of input resolutions (following reed2022scale) as a function of FLOPs required to encode an image (note the log scale on the x-axes). All image-based models resized images to $224\times 224$, so the FLOPs required to encode an image do not change with image resolution. Presto achieved competitive results with image-based models while requiring up to four orders of magnitude less FLOPs to encode an image.
  • ...and 3 more figures