A Genealogy of Foundation Models in Remote Sensing
Kevin Lane, Morteza Karimzadeh
TL;DR
The paper addresses the need for robust, scalable foundation representations in remote sensing by surveying self-supervised learning approaches and tracing their roots to computer vision. It categorizes RS SSL methods into negative sampling, distillation, redundancy reduction, and masked image modeling, and then analyzes RS-specific adaptations across single- and multi-sensor data, including temporal, geolocation, and text modalities. Key contributions include a structured review of RS-specific SSL adaptations, a synthesis of multi-sensor and multi-modal strategies, and a roadmap for cost-efficient, climate-aware RS foundation models. The work highlights opportunities for leveraging unlabeled, seasonal, and diverse sensor data to build RS representations that generalize across tasks and epochs, with practical implications for scalable, robust Earth observation analytics.
Abstract
Foundation models have garnered increasing attention for representation learning in remote sensing. Many such foundation models adopt approaches that have demonstrated success in computer vision with minimal domain-specific modification. However, the development and application of foundation models in this field are still burgeoning, as there are a variety of competing approaches for how to most effectively leverage remotely sensed data. This paper examines these approaches, along with their roots in the computer vision field. This is done to characterize potential advantages and pitfalls, while outlining future directions to further improve remote sensing-specific foundation models. We discuss the quality of the learned representations and methods to alleviate the need for massive compute resources. We first examine single-sensor remote foundation models to introduce concepts and provide context, and then place emphasis on incorporating the multi-sensor aspect of Earth observations into foundation models. In particular, we explore the extent to which existing approaches leverage multiple sensors in training foundation models in relation to multi-modal foundation models. Finally, we identify opportunities for further harnessing the vast amounts of unlabeled, seasonal, and multi-sensor remote sensing observations.
