PyViT-FUSE: A Foundation Model for Multi-Sensor Earth Observation Data
Manuel Weber, Carly Beneke
TL;DR
PyViT-FUSE tackles the fusion of multi-sensor EO data across arbitrary bands and resolutions by learning a band-aware embedding via an attention-based fusion and a pyramidal Vision Transformer. It relies on a decoder-free SwAV self-supervised objective with band-drop augmentation, enabling cross-band generalization without pixel-space reconstruction on an Area of View (AOV) of size $H \times W$. Key contributions include a three-part architecture (Input Module, Fusion Module, Pyramidal ViT), interpretable attention maps that visualize band importance, and a demonstration on PV segmentation showing performance gains as additional modalities are incorporated. This approach enables flexible, scalable fusion of heterogeneous satellite data with practical benefits for downstream tasks under cloud cover and data sparsity.
Abstract
We propose PyViT-FUSE, a foundation model for earth observation data explicitly designed to handle multi-modal imagery by learning to fuse an arbitrary number of mixed-resolution input bands into a single representation through an attention mechanism. The learned patch tokens are further processed by a stack of vision transformers with a novel pyramidal structure. We train the model on a globally sampled dataset in a self-supervised manner, leveraging core concepts of the SwAV algorithm. We show the interpretability of the fusion mechanism by visualization of the attention scores and the models applicability to downstream tasks.
