Towards Knowledge Guided Pretraining Approaches for Multimodal Foundation Models: Applications in Remote Sensing
Praveen Ravirathinam, Ajitesh Parthasarathy, Ankush Khandelwal, Rahul Ghosh, Vipin Kumar
TL;DR
This work introduces KG-VSF, a knowledge-guided pretraining objective that models weather-driven changes in land surface as a conditional generation task to learn causal multimodal embeddings for remote sensing. By employing a two-phase pretraining scheme (masked reconstruction followed by forecasting) and a encoder–decoder architecture with forward-only attention, KG-VSF yields embeddings that outperform traditional masked-reconstruction and other VSF baselines across crop mapping, soil moisture estimation and forecasting, and spectral-imagery tasks. The results demonstrate improved downstream performance, robust embeddings, and evidence that the approach captures causal relationships rather than mere correlations, with strong implications for scalable, physically consistent geospatial foundation models. The methodology and findings highlight a path toward more interpretable and transferable multimodal models in geoscience and potentially other domains where causal drivers shape observable responses.
Abstract
Self-supervised learning has emerged as a powerful paradigm for pretraining foundation models using large-scale data. Existing pretraining approaches predominantly rely on masked reconstruction or next-token prediction strategies, demonstrating strong performance across various downstream tasks, including geoscience applications. However, these approaches do not fully capture the knowledge of causal interplay between different geospatial and environmental variables. To address this limitation, we propose Knowledge Guided Variable-Step Forecasting (KG-VSF), a novel pretraining task that models forecasting as a conditional generation task, where driver variables (e.g., weather) inform the prediction of response variables (e.g., satellite imagery). We demonstrate that pretraining in such a fashion leads to strong embeddings which give enhanced performance when finetuned on downstream tasks where capturing this causality matters such as pixel wise crop type mapping, soil moisture estimation and forecasting, missing image prediction, and future image forecasting when compared to finetuning embeddings from other standard pretraining approaches.
