Bridging Remote Sensors with Multisensor Geospatial Foundation Models
Boran Han, Shuai Zhang, Xingjian Shi, Markus Reichstein
TL;DR
The paper tackles the mismatch between geospatial sensor heterogeneity and existing pretrained representations by presenting msGFM, a multisensor geospatial foundation model that unifies optical and microwave data through cross-sensor masked image modeling. It leverages per-sensor patch embeddings and a shared Swin-transformer encoder, with cross-sensor reconstruction and sensor-specific decoders, trained on GeoPile-2 (~2M images across RGB, Sentinel-2, SEN12MS, and DSM) and optimized with a mixed MIM and auxiliary loss and Mixture-of-Experts to handle modality differences. Empirical results across scene classification, cloud removal, pan-sharpening, and segmentation show msGFM outperforms single-sensor pretraining and benefits notably from cross-sensor reconstruction and scratch-based training, while distillation from natural-image models is less effective due to domain gaps. The work demonstrates the viability and advantages of unified multisensor pretraining for robust geospatial understanding and provides practical guidance for building future multisensor foundation models, with potential extensions to temporal dynamics mindful of computational costs.
Abstract
In the realm of geospatial analysis, the diversity of remote sensors, encompassing both optical and microwave technologies, offers a wealth of distinct observational capabilities. Recognizing this, we present msGFM, a multisensor geospatial foundation model that effectively unifies data from four key sensor modalities. This integration spans an expansive dataset of two million multisensor images. msGFM is uniquely adept at handling both paired and unpaired sensor data. For data originating from identical geolocations, our model employs an innovative cross-sensor pretraining approach in masked image modeling, enabling the synthesis of joint representations from diverse sensors. msGFM, incorporating four remote sensors, upholds strong performance, forming a comprehensive model adaptable to various sensor types. msGFM has demonstrated enhanced proficiency in a range of both single-sensor and multisensor downstream tasks. These include scene classification, segmentation, cloud removal, and pan-sharpening. A key discovery of our research is that representations derived from natural images are not always compatible with the distinct characteristics of geospatial remote sensors, underscoring the limitations of existing representations in this field. Our work can serve as a guide for developing multisensor geospatial pretraining models, paving the way for more advanced geospatial capabilities.
