Table of Contents
Fetching ...

Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques

Lijie Tao, Haokui Zhang, Haizhao Jing, Yu Liu, Dawei Yan, Guoting Wei, Xizhe Xue

TL;DR

This survey analyzes how visual language models are advancing remote sensing by bridging imagery and text. It reviews foundational models (Transformers, ViT, and VLMs like LLaVA), datasets (manual, combined, automatically annotated), and capabilities (pure visual vs. vision-language tasks). It categorizes recent advances into contrastive and conversational VLM strands, comparing performance across RS benchmarks and highlighting leading systems such as SkySenseGPT, RemoteCLIP, and GeoRSCLIP. The paper also discusses practical challenges, data quality, and future directions, including regression tasks, multispectral and SAR integration, multimodal outputs, and multitemporal analysis, to guide researchers and practitioners in deploying RS VLMs effectively.

Abstract

Recently, the remarkable success of ChatGPT has sparked a renewed wave of interest in artificial intelligence (AI), and the advancements in visual language models (VLMs) have pushed this enthusiasm to new heights. Differring from previous AI approaches that generally formulated different tasks as discriminative models, VLMs frame tasks as generative models and align language with visual information, enabling the handling of more challenging problems. The remote sensing (RS) field, a highly practical domain, has also embraced this new trend and introduced several VLM-based RS methods that have demonstrated promising performance and enormous potential. In this paper, we first review the fundamental theories related to VLM, then summarize the datasets constructed for VLMs in remote sensing and the various tasks they addressed. Finally, we categorize the improvement methods into three main parts according to the core components of VLMs and provide a detailed introduction and comparison of these methods. A project associated with this review has been created at https://github.com/taolijie11111/VLMs-in-RS-review.

Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques

TL;DR

This survey analyzes how visual language models are advancing remote sensing by bridging imagery and text. It reviews foundational models (Transformers, ViT, and VLMs like LLaVA), datasets (manual, combined, automatically annotated), and capabilities (pure visual vs. vision-language tasks). It categorizes recent advances into contrastive and conversational VLM strands, comparing performance across RS benchmarks and highlighting leading systems such as SkySenseGPT, RemoteCLIP, and GeoRSCLIP. The paper also discusses practical challenges, data quality, and future directions, including regression tasks, multispectral and SAR integration, multimodal outputs, and multitemporal analysis, to guide researchers and practitioners in deploying RS VLMs effectively.

Abstract

Recently, the remarkable success of ChatGPT has sparked a renewed wave of interest in artificial intelligence (AI), and the advancements in visual language models (VLMs) have pushed this enthusiasm to new heights. Differring from previous AI approaches that generally formulated different tasks as discriminative models, VLMs frame tasks as generative models and align language with visual information, enabling the handling of more challenging problems. The remote sensing (RS) field, a highly practical domain, has also embraced this new trend and introduced several VLM-based RS methods that have demonstrated promising performance and enormous potential. In this paper, we first review the fundamental theories related to VLM, then summarize the datasets constructed for VLMs in remote sensing and the various tasks they addressed. Finally, we categorize the improvement methods into three main parts according to the core components of VLMs and provide a detailed introduction and comparison of these methods. A project associated with this review has been created at https://github.com/taolijie11111/VLMs-in-RS-review.

Paper Structure

This paper contains 19 sections, 1 equation, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Number of publications for VLMs in RS (from Web of Science).
  • Figure 2: Organization of this survey.
  • Figure 3: An illustration of Transformer transformer.
  • Figure 4: The illustration of Vision Transformer vit
  • Figure 5: The illustration of LLaVA LLaVA
  • ...and 5 more figures