Table of Contents
Fetching ...

Towards Geospatial Foundation Models via Continual Pretraining

Matias Mendieta, Boran Han, Xingjian Shi, Yi Zhu, Chen Chen

TL;DR

This work addresses the high resource cost of building geospatial foundation models by introducing GeoPile, a compact, diverse pretraining dataset, and a novel multi-objective continual pretraining framework (GFM) that leverages a frozen ImageNet-22k teacher via feature distillation alongside self-supervised masked image modeling. The approach achieves state-of-the-art or competitive results across seven downstream geospatial tasks (change detection, classification, segmentation, and super-resolution) with substantially lower training time and CO2 impact than prior methods such as SatMAE. Key contributions include the data-centric GeoPile policy, a practical teacher-student MIM framework, and extensive ablations that demonstrate the importance of distillation, data composition, and objective design for efficient geospatial learning. Overall, GFM demonstrates a scalable, sustainable path to effective geospatial foundation models by reusing large-scale natural-image representations while learning valuable in-domain features.

Abstract

Geospatial technologies are becoming increasingly essential in our world for a wide range of applications, including agriculture, urban planning, and disaster response. To help improve the applicability and performance of deep learning models on these geospatial tasks, various works have begun investigating foundation models for this domain. Researchers have explored two prominent approaches for introducing such models in geospatial applications, but both have drawbacks in terms of limited performance benefit or prohibitive training cost. Therefore, in this work, we propose a novel paradigm for building highly effective geospatial foundation models with minimal resource cost and carbon impact. We first construct a compact yet diverse dataset from multiple sources to promote feature diversity, which we term GeoPile. Then, we investigate the potential of continual pretraining from large-scale ImageNet-22k models and propose a multi-objective continual pretraining paradigm, which leverages the strong representations of ImageNet while simultaneously providing the freedom to learn valuable in-domain features. Our approach outperforms previous state-of-the-art geospatial pretraining methods in an extensive evaluation on seven downstream datasets covering various tasks such as change detection, classification, multi-label classification, semantic segmentation, and super-resolution.

Towards Geospatial Foundation Models via Continual Pretraining

TL;DR

This work addresses the high resource cost of building geospatial foundation models by introducing GeoPile, a compact, diverse pretraining dataset, and a novel multi-objective continual pretraining framework (GFM) that leverages a frozen ImageNet-22k teacher via feature distillation alongside self-supervised masked image modeling. The approach achieves state-of-the-art or competitive results across seven downstream geospatial tasks (change detection, classification, segmentation, and super-resolution) with substantially lower training time and CO2 impact than prior methods such as SatMAE. Key contributions include the data-centric GeoPile policy, a practical teacher-student MIM framework, and extensive ablations that demonstrate the importance of distillation, data composition, and objective design for efficient geospatial learning. Overall, GFM demonstrates a scalable, sustainable path to effective geospatial foundation models by reusing large-scale natural-image representations while learning valuable in-domain features.

Abstract

Geospatial technologies are becoming increasingly essential in our world for a wide range of applications, including agriculture, urban planning, and disaster response. To help improve the applicability and performance of deep learning models on these geospatial tasks, various works have begun investigating foundation models for this domain. Researchers have explored two prominent approaches for introducing such models in geospatial applications, but both have drawbacks in terms of limited performance benefit or prohibitive training cost. Therefore, in this work, we propose a novel paradigm for building highly effective geospatial foundation models with minimal resource cost and carbon impact. We first construct a compact yet diverse dataset from multiple sources to promote feature diversity, which we term GeoPile. Then, we investigate the potential of continual pretraining from large-scale ImageNet-22k models and propose a multi-objective continual pretraining paradigm, which leverages the strong representations of ImageNet while simultaneously providing the freedom to learn valuable in-domain features. Our approach outperforms previous state-of-the-art geospatial pretraining methods in an extensive evaluation on seven downstream datasets covering various tasks such as change detection, classification, multi-label classification, semantic segmentation, and super-resolution.
Paper Structure (21 sections, 4 equations, 5 figures, 12 tables)

This paper contains 21 sections, 4 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Our geospatial foundation model (GFM) achieves favorable performance on a broad set of tasks in comparison to other state-of-the-art geospatial pretraining methods (SeCo seco, SatMAE satmae) and ImageNet supervised pretraining baselines. Legend is as follows. Cyan: ImageNet-1k Supervised (ResNet50), Blue: SeCo seco, Purple: ImageNet-22k Supervised (ViT), Orange: SatMAE satmae, Gray: ImageNet-22k Supervised (Swin), Green: GFM (ours).
  • Figure 2: We visualize some example images from the pretraining datasets with Sentinel-2 (left) and GeoPile (right). Sentinel-2 has noticeably much lower feature diversity within a single image and across images than that of our GeoPile pretraining dataset.
  • Figure 3: Our GFM continual pretraining pipeline, which leverages publicly-available large-scale models in concert with our compiled geospatial dataset and pretraining objective. First, we select a concise set of data from various sources, which we term GeoPile (Section \ref{['sec:data']}). Next, we train GFM with our multi-objective continual pretraining approach. Our GFM framework is constructed as a teacher-student paradigm, with two parallel model branches. The teacher $\mathcal{F}^{T}$ is initialized with ImageNet-22k weights (top) and frozen during training. The student $\mathcal{F}^{S}$ is initialized from random initialization (bottom), and is trained to serve as the final geospatial foundation model. In a continual pretraining fashion, we leverage the intermediate features of an ImageNet-22k pretrained model to guide and quicken learning. Furthermore, we build in an MIM objective on the student branch to learn valuable in-domain features directly from the geospatial data.
  • Figure 4: Qualitative results of downstream performance on OSCD comparing our GFM with ImageNet-22k and randomly initialized baselines. White, green, red colors show true positive, false positive, and false negative respectively.
  • Figure 5: a) Distillation stage ablation results. b) Student initialization ablation results. "Both" indicates that the teacher and student branches are initialized with ImageNet weights prior to geospatial pretraining. "Teacher" indicates that just the teacher branch is initialized, as described in Section \ref{['sec:gfm']}.