Table of Contents
Fetching ...

Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning

Ivica Dimitrovski, Vlatko Spasev, Ivan Kitanovski

TL;DR

This paper tackles the challenge of adapting vision-language models to remote sensing scene classification under scarce labeled data. It systematically evaluates four prompt-learning paradigms—CoOp, CoCoOp, MaPLe, and PromptSRC—built on CLIP, against zero-shot CLIP and a frozen-feature linear probe across nine diverse RS datasets. The findings show that prompt learning yields consistent gains in few-shot settings, with PromptSRC providing the strongest cross-domain robustness through self-regularization, and MaPLe excelling in cross-modal alignment. The work demonstrates that a lightweight, architecture-agnostic prompting approach can bridge domain gaps between natural-image pretraining and overhead imagery, offering a practical pathway toward scalable, label-efficient Earth observation systems.

Abstract

Remote sensing applications increasingly rely on deep learning for scene classification. However, their performance is often constrained by the scarcity of labeled data and the high cost of annotation across diverse geographic and sensor domains. While recent vision-language models like CLIP have shown promise by learning transferable representations at scale by aligning visual and textual modalities, their direct application to remote sensing remains suboptimal due to significant domain gaps and the need for task-specific semantic adaptation. To address this critical challenge, we systematically explore prompt learning as a lightweight and efficient adaptation strategy for few-shot remote sensing image scene classification. We evaluate several representative methods, including Context Optimization, Conditional Context Optimization, Multi-modal Prompt Learning, and Prompting with Self-Regulating Constraints. These approaches reflect complementary design philosophies: from static context optimization to conditional prompts for enhanced generalization, multi-modal prompts for joint vision-language adaptation, and semantically regularized prompts for stable learning without forgetting. We benchmark these prompt-learning methods against two standard baselines: zero-shot CLIP with hand-crafted prompts and a linear probe trained on frozen CLIP features. Through extensive experiments on multiple benchmark remote sensing datasets, including cross-dataset generalization tests, we demonstrate that prompt learning consistently outperforms both baselines in few-shot scenarios. Notably, Prompting with Self-Regulating Constraints achieves the most robust cross-domain performance. Our findings underscore prompt learning as a scalable and efficient solution for bridging the domain gap in satellite and aerial imagery, providing a strong foundation for future research in this field.

Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning

TL;DR

This paper tackles the challenge of adapting vision-language models to remote sensing scene classification under scarce labeled data. It systematically evaluates four prompt-learning paradigms—CoOp, CoCoOp, MaPLe, and PromptSRC—built on CLIP, against zero-shot CLIP and a frozen-feature linear probe across nine diverse RS datasets. The findings show that prompt learning yields consistent gains in few-shot settings, with PromptSRC providing the strongest cross-domain robustness through self-regularization, and MaPLe excelling in cross-modal alignment. The work demonstrates that a lightweight, architecture-agnostic prompting approach can bridge domain gaps between natural-image pretraining and overhead imagery, offering a practical pathway toward scalable, label-efficient Earth observation systems.

Abstract

Remote sensing applications increasingly rely on deep learning for scene classification. However, their performance is often constrained by the scarcity of labeled data and the high cost of annotation across diverse geographic and sensor domains. While recent vision-language models like CLIP have shown promise by learning transferable representations at scale by aligning visual and textual modalities, their direct application to remote sensing remains suboptimal due to significant domain gaps and the need for task-specific semantic adaptation. To address this critical challenge, we systematically explore prompt learning as a lightweight and efficient adaptation strategy for few-shot remote sensing image scene classification. We evaluate several representative methods, including Context Optimization, Conditional Context Optimization, Multi-modal Prompt Learning, and Prompting with Self-Regulating Constraints. These approaches reflect complementary design philosophies: from static context optimization to conditional prompts for enhanced generalization, multi-modal prompts for joint vision-language adaptation, and semantically regularized prompts for stable learning without forgetting. We benchmark these prompt-learning methods against two standard baselines: zero-shot CLIP with hand-crafted prompts and a linear probe trained on frozen CLIP features. Through extensive experiments on multiple benchmark remote sensing datasets, including cross-dataset generalization tests, we demonstrate that prompt learning consistently outperforms both baselines in few-shot scenarios. Notably, Prompting with Self-Regulating Constraints achieves the most robust cross-domain performance. Our findings underscore prompt learning as a scalable and efficient solution for bridging the domain gap in satellite and aerial imagery, providing a strong foundation for future research in this field.

Paper Structure

This paper contains 30 sections, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Illustrative examples from the datasets used in our study, showing the dataset source and the associated class label for each image.
  • Figure 2: Illustration of zero-shot classification using CLIP. Class labels are reformulated as natural language prompts and encoded into text embeddings. A satellite image of a tennis court is encoded into an image embedding and compared to all text embeddings using cosine similarity. The correct prediction corresponds to the prompt with the highest similarity, e.g., “a satellite photo of a tennis court”.
  • Figure 3: Comparative illustration of the four prompt-learning paradigms evaluated in this study. In the figure, the snowflake icons denote frozen parameters, while the flame icons indicate learnable components. Deep prompting refers to learnable tokens inserted across multiple layers of the encoder. The symbols T and V represent learnable text and visual prompt embeddings, respectively. Cosine similarity denotes the similarity computation used for fine-tuning, whereas the gray cosine similarity indicates the same computation within the frozen architecture. In the MaPLe diagram, the symbol "e" denotes the coupling function that links the text and vision encoders across several layers. (a) CoOp learns static textual context embeddings appended to class name tokens. (b) CoCoOp introduces instance-conditioned prompts for enhanced generalization. (c) MaPLe jointly learns textual and visual prompts for cross-modal coupling. (d) PromptSRC employs self-regulating constraints to preserve pre-trained alignment while adapting to new tasks.
  • Figure 4: Zero-shot and few-shot classification results across nine remote sensing datasets. Each plot shows performance as a function of the number of shots for four prompt-learning methods (CoOp, CoCoOp, MaPLe, and PromptSRC) compared with zero-shot CLIP (hand-crafted prompts) and the linear probe baseline. Prompt-learning approaches consistently outperform both baselines, with MaPLe and PromptSRC achieving the strongest gains in few-shot scenarios.
  • Figure 5: Average zero-shot and few-shot classification accuracy across the nine benchmark remote sensing datasets. Results compare zero-shot CLIP (hand-crafted prompts), linear probe, and four prompt-learning approaches (CoOp, CoCoOp, MaPLe, and PromptSRC). The plot highlights the consistent improvement achieved by prompt-learning methods, with PromptSRC showing the highest overall accuracy and MaPLe achieving strong cross-modal generalization.
  • ...and 10 more figures