Towards Knowledge Guided Pretraining Approaches for Multimodal Foundation Models: Applications in Remote Sensing

Praveen Ravirathinam; Ajitesh Parthasarathy; Ankush Khandelwal; Rahul Ghosh; Vipin Kumar

Towards Knowledge Guided Pretraining Approaches for Multimodal Foundation Models: Applications in Remote Sensing

Praveen Ravirathinam, Ajitesh Parthasarathy, Ankush Khandelwal, Rahul Ghosh, Vipin Kumar

TL;DR

This work introduces KG-VSF, a knowledge-guided pretraining objective that models weather-driven changes in land surface as a conditional generation task to learn causal multimodal embeddings for remote sensing. By employing a two-phase pretraining scheme (masked reconstruction followed by forecasting) and a encoder–decoder architecture with forward-only attention, KG-VSF yields embeddings that outperform traditional masked-reconstruction and other VSF baselines across crop mapping, soil moisture estimation and forecasting, and spectral-imagery tasks. The results demonstrate improved downstream performance, robust embeddings, and evidence that the approach captures causal relationships rather than mere correlations, with strong implications for scalable, physically consistent geospatial foundation models. The methodology and findings highlight a path toward more interpretable and transferable multimodal models in geoscience and potentially other domains where causal drivers shape observable responses.

Abstract

Self-supervised learning has emerged as a powerful paradigm for pretraining foundation models using large-scale data. Existing pretraining approaches predominantly rely on masked reconstruction or next-token prediction strategies, demonstrating strong performance across various downstream tasks, including geoscience applications. However, these approaches do not fully capture the knowledge of causal interplay between different geospatial and environmental variables. To address this limitation, we propose Knowledge Guided Variable-Step Forecasting (KG-VSF), a novel pretraining task that models forecasting as a conditional generation task, where driver variables (e.g., weather) inform the prediction of response variables (e.g., satellite imagery). We demonstrate that pretraining in such a fashion leads to strong embeddings which give enhanced performance when finetuned on downstream tasks where capturing this causality matters such as pixel wise crop type mapping, soil moisture estimation and forecasting, missing image prediction, and future image forecasting when compared to finetuning embeddings from other standard pretraining approaches.

Towards Knowledge Guided Pretraining Approaches for Multimodal Foundation Models: Applications in Remote Sensing

TL;DR

Abstract

Paper Structure (40 sections, 13 figures, 14 tables)

This paper contains 40 sections, 13 figures, 14 tables.

Introduction
Related Work
Architecture
Pretraining
Dataset Description
Data Sources
Datasets
Masking
Experimental Evaluation
Comparative Pretraining Frameworks
Implementation and Pretraining Details
Results: Pixel Wise Crop Mapping
Results: Soil Moisture Tasks
Soil Moisture Estimation
Soil Moisture Forecasting
...and 25 more sections

Figures (13)

Figure 1: Comparison of different pretraining tasks: Masked Autoencoding (MR) and Variable Step Forecasting (VSF) in both single- and multi-modality settings.
Figure 2: Our proposed novel Knowledge Guided Variable Step Forecasting (KG-VSF) task. Our pretraining task estimates a satellite image in the future (orange) using satellite imagery and weather context (yellow) and weather data up to that future date (green).
Figure 3: Knowledge Guided Variable step Forecasting (KG-VSF) Architecture diagram
Figure 4: Comparison of predictions for 50% missing values across finetuned models from different pretraining tasks.
Figure 5: Image Forecast Downstream Task comparison. Row 1 depicts a crop field, Green arrows depict regions of growth, and Red arrows depict regions of harvest, KG-VSF captures both these phenomena better than SM-VSF. Row 2 depicts a case where KG-VSF adds snowfall accurately compared to SM-VSF. Row 3 depicts a case where KG-VSF does not change land cover due to terrain but SM-VSF adds false greenness. Please zoom for better viewing.
...and 8 more figures

Towards Knowledge Guided Pretraining Approaches for Multimodal Foundation Models: Applications in Remote Sensing

TL;DR

Abstract

Towards Knowledge Guided Pretraining Approaches for Multimodal Foundation Models: Applications in Remote Sensing

Authors

TL;DR

Abstract

Table of Contents

Figures (13)