Table of Contents
Fetching ...

Bridging Remote Sensors with Multisensor Geospatial Foundation Models

Boran Han, Shuai Zhang, Xingjian Shi, Markus Reichstein

TL;DR

The paper tackles the mismatch between geospatial sensor heterogeneity and existing pretrained representations by presenting msGFM, a multisensor geospatial foundation model that unifies optical and microwave data through cross-sensor masked image modeling. It leverages per-sensor patch embeddings and a shared Swin-transformer encoder, with cross-sensor reconstruction and sensor-specific decoders, trained on GeoPile-2 (~2M images across RGB, Sentinel-2, SEN12MS, and DSM) and optimized with a mixed MIM and auxiliary loss and Mixture-of-Experts to handle modality differences. Empirical results across scene classification, cloud removal, pan-sharpening, and segmentation show msGFM outperforms single-sensor pretraining and benefits notably from cross-sensor reconstruction and scratch-based training, while distillation from natural-image models is less effective due to domain gaps. The work demonstrates the viability and advantages of unified multisensor pretraining for robust geospatial understanding and provides practical guidance for building future multisensor foundation models, with potential extensions to temporal dynamics mindful of computational costs.

Abstract

In the realm of geospatial analysis, the diversity of remote sensors, encompassing both optical and microwave technologies, offers a wealth of distinct observational capabilities. Recognizing this, we present msGFM, a multisensor geospatial foundation model that effectively unifies data from four key sensor modalities. This integration spans an expansive dataset of two million multisensor images. msGFM is uniquely adept at handling both paired and unpaired sensor data. For data originating from identical geolocations, our model employs an innovative cross-sensor pretraining approach in masked image modeling, enabling the synthesis of joint representations from diverse sensors. msGFM, incorporating four remote sensors, upholds strong performance, forming a comprehensive model adaptable to various sensor types. msGFM has demonstrated enhanced proficiency in a range of both single-sensor and multisensor downstream tasks. These include scene classification, segmentation, cloud removal, and pan-sharpening. A key discovery of our research is that representations derived from natural images are not always compatible with the distinct characteristics of geospatial remote sensors, underscoring the limitations of existing representations in this field. Our work can serve as a guide for developing multisensor geospatial pretraining models, paving the way for more advanced geospatial capabilities.

Bridging Remote Sensors with Multisensor Geospatial Foundation Models

TL;DR

The paper tackles the mismatch between geospatial sensor heterogeneity and existing pretrained representations by presenting msGFM, a multisensor geospatial foundation model that unifies optical and microwave data through cross-sensor masked image modeling. It leverages per-sensor patch embeddings and a shared Swin-transformer encoder, with cross-sensor reconstruction and sensor-specific decoders, trained on GeoPile-2 (~2M images across RGB, Sentinel-2, SEN12MS, and DSM) and optimized with a mixed MIM and auxiliary loss and Mixture-of-Experts to handle modality differences. Empirical results across scene classification, cloud removal, pan-sharpening, and segmentation show msGFM outperforms single-sensor pretraining and benefits notably from cross-sensor reconstruction and scratch-based training, while distillation from natural-image models is less effective due to domain gaps. The work demonstrates the viability and advantages of unified multisensor pretraining for robust geospatial understanding and provides practical guidance for building future multisensor foundation models, with potential extensions to temporal dynamics mindful of computational costs.

Abstract

In the realm of geospatial analysis, the diversity of remote sensors, encompassing both optical and microwave technologies, offers a wealth of distinct observational capabilities. Recognizing this, we present msGFM, a multisensor geospatial foundation model that effectively unifies data from four key sensor modalities. This integration spans an expansive dataset of two million multisensor images. msGFM is uniquely adept at handling both paired and unpaired sensor data. For data originating from identical geolocations, our model employs an innovative cross-sensor pretraining approach in masked image modeling, enabling the synthesis of joint representations from diverse sensors. msGFM, incorporating four remote sensors, upholds strong performance, forming a comprehensive model adaptable to various sensor types. msGFM has demonstrated enhanced proficiency in a range of both single-sensor and multisensor downstream tasks. These include scene classification, segmentation, cloud removal, and pan-sharpening. A key discovery of our research is that representations derived from natural images are not always compatible with the distinct characteristics of geospatial remote sensors, underscoring the limitations of existing representations in this field. Our work can serve as a guide for developing multisensor geospatial pretraining models, paving the way for more advanced geospatial capabilities.
Paper Structure (27 sections, 2 equations, 7 figures, 12 tables)

This paper contains 27 sections, 2 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Examples of four sensor modalities: SAR, Sentinel 2, RGB, DSM. Here, each pair of {SAR & Sentinel-2} and {RGB & DSM} are colocated on the same geolocation respectively. In the example of Sentinel-2, only blue, green, red bands are shown for the convenience of visualization.
  • Figure 2: Overview diagram of msGFM. Each sensor is fed through a separate patch embedding layer (Section \ref{['sec:input']}) and through the same encoder. For reconstruction, separate decoders are used. If the sensors are paired, there's a chance that our model will reconstruct the corresponding paired sensor instead of itself (Section \ref{['sec:cross']}). Other best practices can be found in Section \ref{['sec:moe']}. In the finetuning stage, the pretrained encoder (msGFM) is transferred to different downstream applications with different prediction heads. In Appendix \ref{['experimental_settings']}, we discuss the usage of patch embedding in the downstream finetuning.
  • Figure 3: Examples of cross-sensor pretraining. The first row represents the input before masking, the second row depicts the reconstruction, and the third row shows the ground truth.
  • Figure 4: Comparison of our multisensor approach with single modality pretraining on 10% of the BEN dataset (left) using mAP ($\uparrow$) and 1% of SEN12MS-CR (right) using SAM ($\downarrow$). Given the reduced dataset size for cloud removal (1% of SEN12MS-CR), we conduct the experiment in three replicates and report both the mean (top line in each cell) and standard deviations (bottom line in each cell).
  • Figure 5: Figure R1: SAR backscatter statistics comparing input and reconstruction using the MIM. The two bands of SAR are HV and VV. The mean and standard deviation for the HV band are shown on the left, while those for the VV band are displayed on the right. The Speckle Suppression Index (SSI) values are presented in the right panel. An SSI value closer to one indicates that the mean and standard deviation remain consistent before and after reconstruction.
  • ...and 2 more figures