Table of Contents
Fetching ...

Paving the way toward foundation models for irregular and unaligned Satellite Image Time Series

Iris Dumeur, Silvia Valero, Jordi Inglada

TL;DR

An ALIgned Sits Encoder (ALISE) is proposed, a novel approach that leverages the spatial, spectral, and temporal dimensions of irregular and unaligned satellite image time series (SITS) while producing aligned latent representations.

Abstract

Although recently several foundation models for satellite remote sensing imagery have been proposed, they fail to address major challenges of real/operational applications. Indeed, embeddings that don't take into account the spectral, spatial and temporal dimensions of the data as well as the irregular or unaligned temporal sampling are of little use for most real world uses. As a consequence, we propose an ALIgned Sits Encoder (ALISE), a novel approach that leverages the spatial, spectral, and temporal dimensions of irregular and unaligned SITS while producing aligned latent representations. Unlike SSL models currently available for SITS, ALISE incorporates a flexible query mechanism to project the SITS into a common and learned temporal projection space. Additionally, thanks to a multi-view framework, we explore integration of instance discrimination along a masked autoencoding task to SITS. The quality of the produced representation is assessed through three downstream tasks: crop segmentation (PASTIS), land cover segmentation (MultiSenGE), and a novel crop change detection dataset. Furthermore, the change detection task is performed without supervision. The results suggest that the use of aligned representations is more effective than previous SSL methods for linear probing segmentation tasks.

Paving the way toward foundation models for irregular and unaligned Satellite Image Time Series

TL;DR

An ALIgned Sits Encoder (ALISE) is proposed, a novel approach that leverages the spatial, spectral, and temporal dimensions of irregular and unaligned satellite image time series (SITS) while producing aligned latent representations.

Abstract

Although recently several foundation models for satellite remote sensing imagery have been proposed, they fail to address major challenges of real/operational applications. Indeed, embeddings that don't take into account the spectral, spatial and temporal dimensions of the data as well as the irregular or unaligned temporal sampling are of little use for most real world uses. As a consequence, we propose an ALIgned Sits Encoder (ALISE), a novel approach that leverages the spatial, spectral, and temporal dimensions of irregular and unaligned SITS while producing aligned latent representations. Unlike SSL models currently available for SITS, ALISE incorporates a flexible query mechanism to project the SITS into a common and learned temporal projection space. Additionally, thanks to a multi-view framework, we explore integration of instance discrimination along a masked autoencoding task to SITS. The quality of the produced representation is assessed through three downstream tasks: crop segmentation (PASTIS), land cover segmentation (MultiSenGE), and a novel crop change detection dataset. Furthermore, the change detection task is performed without supervision. The results suggest that the use of aligned representations is more effective than previous SSL methods for linear probing segmentation tasks.
Paper Structure (43 sections, 13 equations, 15 figures, 7 tables)

This paper contains 43 sections, 13 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Overall description of ALISE architecture. The input time series X is first processed by the spectral spatial temporal encoder (SSTE) dumeur-2024-self-super. The obtained intermediate representations are then processed by a temporal projector. The temporal projector corresponds to a cross-attention mechanism with learnable queries $Q_{\alpha}$. For visual clarity, the cross-attention is represented for one attention head.
  • Figure 2: Description of the proposed multi-view SSL strategy. Given an input time series $X$ two views are generated: $X^A$ and $X^B$. Each view is processed independently by ALISE which generates their respective aligned latent representations $Y^A$ and $Y^B$. A decoder $g_{\phi}$ is trained to reconstruct one view using the latent representation of the other. Additional discriminative losses are computed on the latent representation.
  • Figure 3: Description of the projector architecture.
  • Figure 4: Description of the lightweight decoder employed for the cross-reconstruction task.
  • Figure 5: Geographical distributions of the different tiles composing the data-sets. The unlabeled pre-training data-set is composed of multi-year SITS selected within the blue and red boxes for the training and validation data-set respectively. MultiSenGE labeled data are selected in the area delineated by the black boxes. The PASTIS as well as CropRot data-sets are within the green boxes.
  • ...and 10 more figures