A Survey on Remote Sensing Foundation Models: From Vision to Multimodality
Ziyue Huang, Hongxi Yan, Qiqi Zhan, Shuai Yang, Mingming Zhang, Chenkai Zhang, YiMing Lei, Zeming Liu, Qingjie Liu, Yunhong Wang
TL;DR
The paper surveys remote sensing foundation models, detailing architecture (vision-only and multimodal), training paradigms (contrastive, generative, and hybrid), data sources (custom and public RS datasets), and evaluation across a broad task spectrum. It introduces a two-dimensional taxonomy (model architecture and primary functionality) to organize progress and highlight how RS data characteristics shape model design, training, and deployment. Key contributions include a comprehensive taxonomy, a resource repository, and synthesis of benchmarks and results that expose strengths and limitations of current RS foundation models. The work underscores practical implications for scalable, cross-modal geospatial understanding and points to directions such as multimodal fusion, geographic knowledge integration, and efficient architectures to push real-world impact. Overall, the survey clarifies the state of the field and provides a roadmap for advancing robust, generalizable RS foundation models in diverse environments.
Abstract
The rapid advancement of remote sensing foundation models, particularly vision and multimodal models, has significantly enhanced the capabilities of intelligent geospatial data interpretation. These models combine various data modalities, such as optical, radar, and LiDAR imagery, with textual and geographic information, enabling more comprehensive analysis and understanding of remote sensing data. The integration of multiple modalities allows for improved performance in tasks like object detection, land cover classification, and change detection, which are often challenged by the complex and heterogeneous nature of remote sensing data. However, despite these advancements, several challenges remain. The diversity in data types, the need for large-scale annotated datasets, and the complexity of multimodal fusion techniques pose significant obstacles to the effective deployment of these models. Moreover, the computational demands of training and fine-tuning multimodal models require significant resources, further complicating their practical application in remote sensing image interpretation tasks. This paper provides a comprehensive review of the state-of-the-art in vision and multimodal foundation models for remote sensing, focusing on their architecture, training methods, datasets and application scenarios. We discuss the key challenges these models face, such as data alignment, cross-modal transfer learning, and scalability, while also identifying emerging research directions aimed at overcoming these limitations. Our goal is to provide a clear understanding of the current landscape of remote sensing foundation models and inspire future research that can push the boundaries of what these models can achieve in real-world applications. The list of resources collected by the paper can be found in the https://github.com/IRIP-BUAA/A-Review-for-remote-sensing-vision-language-models.
