Table of Contents
Fetching ...

PyViT-FUSE: A Foundation Model for Multi-Sensor Earth Observation Data

Manuel Weber, Carly Beneke

TL;DR

PyViT-FUSE tackles the fusion of multi-sensor EO data across arbitrary bands and resolutions by learning a band-aware embedding via an attention-based fusion and a pyramidal Vision Transformer. It relies on a decoder-free SwAV self-supervised objective with band-drop augmentation, enabling cross-band generalization without pixel-space reconstruction on an Area of View (AOV) of size $H \times W$. Key contributions include a three-part architecture (Input Module, Fusion Module, Pyramidal ViT), interpretable attention maps that visualize band importance, and a demonstration on PV segmentation showing performance gains as additional modalities are incorporated. This approach enables flexible, scalable fusion of heterogeneous satellite data with practical benefits for downstream tasks under cloud cover and data sparsity.

Abstract

We propose PyViT-FUSE, a foundation model for earth observation data explicitly designed to handle multi-modal imagery by learning to fuse an arbitrary number of mixed-resolution input bands into a single representation through an attention mechanism. The learned patch tokens are further processed by a stack of vision transformers with a novel pyramidal structure. We train the model on a globally sampled dataset in a self-supervised manner, leveraging core concepts of the SwAV algorithm. We show the interpretability of the fusion mechanism by visualization of the attention scores and the models applicability to downstream tasks.

PyViT-FUSE: A Foundation Model for Multi-Sensor Earth Observation Data

TL;DR

PyViT-FUSE tackles the fusion of multi-sensor EO data across arbitrary bands and resolutions by learning a band-aware embedding via an attention-based fusion and a pyramidal Vision Transformer. It relies on a decoder-free SwAV self-supervised objective with band-drop augmentation, enabling cross-band generalization without pixel-space reconstruction on an Area of View (AOV) of size . Key contributions include a three-part architecture (Input Module, Fusion Module, Pyramidal ViT), interpretable attention maps that visualize band importance, and a demonstration on PV segmentation showing performance gains as additional modalities are incorporated. This approach enables flexible, scalable fusion of heterogeneous satellite data with practical benefits for downstream tasks under cloud cover and data sparsity.

Abstract

We propose PyViT-FUSE, a foundation model for earth observation data explicitly designed to handle multi-modal imagery by learning to fuse an arbitrary number of mixed-resolution input bands into a single representation through an attention mechanism. The learned patch tokens are further processed by a stack of vision transformers with a novel pyramidal structure. We train the model on a globally sampled dataset in a self-supervised manner, leveraging core concepts of the SwAV algorithm. We show the interpretability of the fusion mechanism by visualization of the attention scores and the models applicability to downstream tasks.

Paper Structure

This paper contains 12 sections, 1 equation, 13 figures.

Figures (13)

  • Figure 1: Model architecture of PyViT-FUSE consisting of three main components. An embedding representing the multi-modal input is generated independent of band combination.
  • Figure 2: Illustration of band drop data augmentation and self-supervised training with SwAV algorithm. Band drop is visualized by coding the sensor as color hue and its bands as different shades for 8 batches. Blacked out parts indicate dropped channels.
  • Figure 3: Sample input visualized as RGB image for each modality and corresponding averaged feature maps at the output of each ViT pyramid block. Colors indicate low (black) to high (red) activations.
  • Figure 4: Cosine similarity (left) and $L^{2}$ distance (right) between the embeddings of the global and a local view for a sample against all other samples in the batch.
  • Figure 5: Visualization of attention scores for each head of the fusion module. The color corresponds to the band with the highest score indicating the corresponding importance of the band.
  • ...and 8 more figures