Table of Contents
Fetching ...

Few-Shot Adaptation Benchmark for Remote Sensing Vision-Language Models

Karim El Khoury, Maxime Zanella, Christophe De Vleeschouwer, Benoit Macq

TL;DR

This work tackles the limited generalization of remote sensing vision-language models (RSVLMs) in low-data regimes by introducing the first structured few-shot adaptation benchmark for RSVLMs. It benchmarks ten RS scene datasets using three RSVLMs and five adaptation methods, providing a reproducible framework and open-source code. Key findings show that zero-shot performance does not predict few-shot results, with GeoRSCLIP delivering the strongest few-shot generalization across datasets, while no single adaptation method dominates across all settings. The study also analyzes computational costs and backbone scaling, offering guidance for selecting methods under practical constraints and outlining directions for RS-specific few-shot strategies.

Abstract

Remote Sensing Vision-Language Models (RSVLMs) have shown remarkable potential thanks to large-scale pretraining, achieving strong zero-shot performance on various tasks. However, their ability to generalize in low-data regimes, such as few-shot learning, remains insufficiently explored. In this work, we present the first structured benchmark for evaluating few-shot adaptation methods on RSVLMs. We conduct comprehensive experiments across ten remote sensing scene classification datasets, applying five widely used few-shot adaptation strategies to three state-of-the-art RSVLMs with varying backbones. Our findings reveal that models with similar zero-shot performance can exhibit markedly different behavior under few-shot adaptation, with some RSVLMs being inherently more amenable to such adaptation than others. The variability of performance and the absence of a clear winner among existing methods highlight the need for the development of more robust methods for few-shot adaptation tailored to RS. To facilitate future research, we provide a reproducible benchmarking framework and open-source code to systematically evaluate RSVLMs under few-shot conditions. The source code is publicly available on Github: https://github.com/elkhouryk/fewshot_RSVLMs

Few-Shot Adaptation Benchmark for Remote Sensing Vision-Language Models

TL;DR

This work tackles the limited generalization of remote sensing vision-language models (RSVLMs) in low-data regimes by introducing the first structured few-shot adaptation benchmark for RSVLMs. It benchmarks ten RS scene datasets using three RSVLMs and five adaptation methods, providing a reproducible framework and open-source code. Key findings show that zero-shot performance does not predict few-shot results, with GeoRSCLIP delivering the strongest few-shot generalization across datasets, while no single adaptation method dominates across all settings. The study also analyzes computational costs and backbone scaling, offering guidance for selecting methods under practical constraints and outlining directions for RS-specific few-shot strategies.

Abstract

Remote Sensing Vision-Language Models (RSVLMs) have shown remarkable potential thanks to large-scale pretraining, achieving strong zero-shot performance on various tasks. However, their ability to generalize in low-data regimes, such as few-shot learning, remains insufficiently explored. In this work, we present the first structured benchmark for evaluating few-shot adaptation methods on RSVLMs. We conduct comprehensive experiments across ten remote sensing scene classification datasets, applying five widely used few-shot adaptation strategies to three state-of-the-art RSVLMs with varying backbones. Our findings reveal that models with similar zero-shot performance can exhibit markedly different behavior under few-shot adaptation, with some RSVLMs being inherently more amenable to such adaptation than others. The variability of performance and the absence of a clear winner among existing methods highlight the need for the development of more robust methods for few-shot adaptation tailored to RS. To facilitate future research, we provide a reproducible benchmarking framework and open-source code to systematically evaluate RSVLMs under few-shot conditions. The source code is publicly available on Github: https://github.com/elkhouryk/fewshot_RSVLMs

Paper Structure

This paper contains 12 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Performance evaluation of four vision-language models (GeoRSCLIP, RemoteCLIP, SkyCLIP, and CLIP) using five different few-shot adaptation methods. Results are computed with ViT-B/32 backbone and represent the average performance across the ten benchmark datasets over three random seeds.
  • Figure 2: Comparison of training time for five few-shot adaptation methods on an NVIDIA A100 80GB GPU, tested on GeoRSCLIP with a ViT-B/32 backbone using 4-shot samples from the MLRSNet dataset (46 classes) and original hyperparameters from their respective publications.