Table of Contents
Fetching ...

OmniSat: Self-Supervised Modality Fusion for Earth Observation

Guillaume Astruc, Nicolas Gonthier, Clement Mallet, Loic Landrieu

TL;DR

OmniSat addresses the need for self-supervised fusion of heterogeneous Earth Observation data by leveraging precise georeferenced alignment and a patch-level training paradigm. It combines modality-specific encoders with a multimodal fusion module and dual objectives—a cross-modal contrastive loss and a multimodal reconstruction task—to learn rich representations without labels. The authors augment two EO benchmarks with new modalities and demonstrate state-of-the-art performance across forestry, land cover, and crop mapping tasks in both semi- and fully supervised settings, with notable gains even when only a single modality is available at inference. Ablations and efficiency analyses show the value of motion-aware reconstruction, modality-specific decoders, and careful architectural choices for EO data, while releasing datasets and code to foster further multimodal EO research.

Abstract

The diversity and complementarity of sensors available for Earth Observations (EO) calls for developing bespoke self-supervised multimodal learning approaches. However, current multimodal EO datasets and models typically focus on a single data type, either mono-date images or time series, which limits their impact. To address this issue, we introduce OmniSat, a novel architecture able to merge diverse EO modalities into expressive features without labels by exploiting their alignment. To demonstrate the advantages of our approach, we create two new multimodal datasets by augmenting existing ones with new modalities. As demonstrated for three downstream tasks -- forestry, land cover classification, and crop mapping -- OmniSat can learn rich representations without supervision, leading to state-of-the-art performances in semi- and fully supervised settings. Furthermore, our multimodal pretraining scheme improves performance even when only one modality is available for inference. The code and dataset are available at https://github.com/gastruc/OmniSat.

OmniSat: Self-Supervised Modality Fusion for Earth Observation

TL;DR

OmniSat addresses the need for self-supervised fusion of heterogeneous Earth Observation data by leveraging precise georeferenced alignment and a patch-level training paradigm. It combines modality-specific encoders with a multimodal fusion module and dual objectives—a cross-modal contrastive loss and a multimodal reconstruction task—to learn rich representations without labels. The authors augment two EO benchmarks with new modalities and demonstrate state-of-the-art performance across forestry, land cover, and crop mapping tasks in both semi- and fully supervised settings, with notable gains even when only a single modality is available at inference. Ablations and efficiency analyses show the value of motion-aware reconstruction, modality-specific decoders, and careful architectural choices for EO data, while releasing datasets and code to foster further multimodal EO research.

Abstract

The diversity and complementarity of sensors available for Earth Observations (EO) calls for developing bespoke self-supervised multimodal learning approaches. However, current multimodal EO datasets and models typically focus on a single data type, either mono-date images or time series, which limits their impact. To address this issue, we introduce OmniSat, a novel architecture able to merge diverse EO modalities into expressive features without labels by exploiting their alignment. To demonstrate the advantages of our approach, we create two new multimodal datasets by augmenting existing ones with new modalities. As demonstrated for three downstream tasks -- forestry, land cover classification, and crop mapping -- OmniSat can learn rich representations without supervision, leading to state-of-the-art performances in semi- and fully supervised settings. Furthermore, our multimodal pretraining scheme improves performance even when only one modality is available for inference. The code and dataset are available at https://github.com/gastruc/OmniSat.
Paper Structure (22 sections, 4 equations, 7 figures, 7 tables)

This paper contains 22 sections, 4 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Datasets. We represent three tiles from the considered multilabel classification datasets: FLAIR (\ref{['fig:data:flair']}), TreeSatAI-TS (\ref{['fig:data:treesat']}) and PASTIS-HD (\ref{['fig:data:pastis']}). TreeSatAI-TS is a new dataset built by replacing the single-date Sentinel-1 and 2 images of TreeSatAI ahlswede2022treesatai by year-long time series. PASTIS-HD (\ref{['fig:data:pastis']}) adds VHR satellite images to PASTIS-R garnot2022multi. $\star$ : modalities added in this work.
  • Figure 2: OmniSat Architecture. We illustrate OmniSat for $M=3$ modalities, and a tile split into $P=4$ patches. The $M\times P$ input tokens $x^\mathbf{M}_\mathbf{P}$ are encoded by $M$ modality-specific encoders $\mathcal{E}^\mathbf{M}$, yielding the token representations $f^\mathbf{M}_\mathbf{P}$. The module $\mathcal{C}$ combines them into multimodal patch representations $f^\star_\mathbf{P}$. The token embeddings $f^\mathbf{M}_\mathbf{P}$ are supervised by a contrastive cross-modal objective. We also use a reconstruction objective: the masked multimodal representations $f^\star_\mathbf{P}$ are decoded by modality-specific networks $\mathcal{D}^\mathbf{M}$ to reconstruct their corresponding inputs in $x^\mathbf{M}_\mathbf{P}$.
  • Figure 3: OmniSat Architecture. OmniSat is composed of dedicated patch encoders for image (\ref{['fig:implem:a']}) and time series \ref{['fig:implem:b']}, here represented for a length of $L=4$ time stamps. The modality combining module $\mathcal{C}$ is depicted in (\ref{['fig:implem:c']}) with $P=2$ and $M=3$. Elements colored in orange are learned networks or parameters.
  • Figure 4:
  • Figure 5: Efficiency. We report the best performance of different models between TreeSatAI and TreeSatAI-TS, with pre-training and fine-tuning using $100$% of labels. The area of the markers is proportional to the training time, broken down in pre-training and fine-tuning when applicable.
  • ...and 2 more figures