Towards Geospatial Foundation Models via Continual Pretraining
Matias Mendieta, Boran Han, Xingjian Shi, Yi Zhu, Chen Chen
TL;DR
This work addresses the high resource cost of building geospatial foundation models by introducing GeoPile, a compact, diverse pretraining dataset, and a novel multi-objective continual pretraining framework (GFM) that leverages a frozen ImageNet-22k teacher via feature distillation alongside self-supervised masked image modeling. The approach achieves state-of-the-art or competitive results across seven downstream geospatial tasks (change detection, classification, segmentation, and super-resolution) with substantially lower training time and CO2 impact than prior methods such as SatMAE. Key contributions include the data-centric GeoPile policy, a practical teacher-student MIM framework, and extensive ablations that demonstrate the importance of distillation, data composition, and objective design for efficient geospatial learning. Overall, GFM demonstrates a scalable, sustainable path to effective geospatial foundation models by reusing large-scale natural-image representations while learning valuable in-domain features.
Abstract
Geospatial technologies are becoming increasingly essential in our world for a wide range of applications, including agriculture, urban planning, and disaster response. To help improve the applicability and performance of deep learning models on these geospatial tasks, various works have begun investigating foundation models for this domain. Researchers have explored two prominent approaches for introducing such models in geospatial applications, but both have drawbacks in terms of limited performance benefit or prohibitive training cost. Therefore, in this work, we propose a novel paradigm for building highly effective geospatial foundation models with minimal resource cost and carbon impact. We first construct a compact yet diverse dataset from multiple sources to promote feature diversity, which we term GeoPile. Then, we investigate the potential of continual pretraining from large-scale ImageNet-22k models and propose a multi-objective continual pretraining paradigm, which leverages the strong representations of ImageNet while simultaneously providing the freedom to learn valuable in-domain features. Our approach outperforms previous state-of-the-art geospatial pretraining methods in an extensive evaluation on seven downstream datasets covering various tasks such as change detection, classification, multi-label classification, semantic segmentation, and super-resolution.
