Table of Contents
Fetching ...

Towards Vision-Language Geo-Foundation Model: A Survey

Yue Zhou, Zhihang Zhong, Xue Yang

TL;DR

This survey addresses the need for specialized vision-language models in earth observation by introducing Vision-Language Geo-Foundation Models (VLGFMs) and surveys the data-centric approaches that drive their development. It categorizes VLGFMs into contrastive, conversational, and generative architectures, detailing data pipelines, architectural components, and the 20-capability taxonomy that spans perception and reasoning tasks. The paper synthesizes tasks, datasets, and evaluation metrics across IS, IC, VQA, VG, and related geospatial benchmarks, while highlighting current challenges such as limited high-resolution data, training costs, and hallucinations in LLMs. It concludes with actionable future directions including more powerful LLMs, richer benchmarks, training-free strategies, and enhanced interpretability to improve reliability and real-world impact in remote sensing analysis.

Abstract

Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding. However, most methods rely on training with general image datasets, and the lack of geospatial data leads to poor performance on earth observation. Numerous geospatial image-text pair datasets and VLFMs fine-tuned on them have been proposed recently. These new approaches aim to leverage large-scale, multimodal geospatial data to build versatile intelligent models with diverse geo-perceptive capabilities, which we refer to as Vision-Language Geo-Foundation Models (VLGFMs). This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field. In particular, we introduce the background and motivation behind the rise of VLGFMs, highlighting their unique research significance. Then, we systematically summarize the core technologies employed in VLGFMs, including data construction, model architectures, and applications of various multimodal geospatial tasks. Finally, we conclude with insights, issues, and discussions regarding future research directions. To the best of our knowledge, this is the first comprehensive literature review of VLGFMs. We keep tracing related works at https://github.com/zytx121/Awesome-VLGFM.

Towards Vision-Language Geo-Foundation Model: A Survey

TL;DR

This survey addresses the need for specialized vision-language models in earth observation by introducing Vision-Language Geo-Foundation Models (VLGFMs) and surveys the data-centric approaches that drive their development. It categorizes VLGFMs into contrastive, conversational, and generative architectures, detailing data pipelines, architectural components, and the 20-capability taxonomy that spans perception and reasoning tasks. The paper synthesizes tasks, datasets, and evaluation metrics across IS, IC, VQA, VG, and related geospatial benchmarks, while highlighting current challenges such as limited high-resolution data, training costs, and hallucinations in LLMs. It concludes with actionable future directions including more powerful LLMs, richer benchmarks, training-free strategies, and enhanced interpretability to improve reliability and real-world impact in remote sensing analysis.

Abstract

Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding. However, most methods rely on training with general image datasets, and the lack of geospatial data leads to poor performance on earth observation. Numerous geospatial image-text pair datasets and VLFMs fine-tuned on them have been proposed recently. These new approaches aim to leverage large-scale, multimodal geospatial data to build versatile intelligent models with diverse geo-perceptive capabilities, which we refer to as Vision-Language Geo-Foundation Models (VLGFMs). This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field. In particular, we introduce the background and motivation behind the rise of VLGFMs, highlighting their unique research significance. Then, we systematically summarize the core technologies employed in VLGFMs, including data construction, model architectures, and applications of various multimodal geospatial tasks. Finally, we conclude with insights, issues, and discussions regarding future research directions. To the best of our knowledge, this is the first comprehensive literature review of VLGFMs. We keep tracing related works at https://github.com/zytx121/Awesome-VLGFM.
Paper Structure (21 sections, 4 figures, 5 tables)

This paper contains 21 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Timeline of representative VLGFMs. Purple entities indicate projects that currently play a significant role in advancing the development of VLGFMs. This field is experiencing rapid growth. For additional resources and daily updates, visit our GitHub page.
  • Figure 2: Architecture of representative VLGFMs: (a) Contrastive, (b) Conversational, (c) Generative.
  • Figure 3: Overview of hierarchical capability levels of VLGFMs from $L_0$ to $L_2$. .
  • Figure 4: Some examples of VLGFM's capabilities.