Table of Contents
Fetching ...

Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification

Karim El Khoury, Maxime Zanella, Benoît Gérin, Tiffanie Godelaine, Benoît Macq, Saïd Mahmoudi, Christophe De Vleeschouwer, Ismail Ben Ayed

TL;DR

This work targets zero-shot remote sensing scene classification by addressing the limitations of inductive patch-wise inference in Vision-Language Models. It introduces RS-TransCLIP, a transductive approach operating entirely in the embedding space that combines a Gaussian Mixture Model, affinity-based Laplacian regularization, and KL divergence from initial text-derived pseudo-labels to refine predictions without supervision. The method yields consistent, significant accuracy gains across 10 RS datasets and multiple state-of-the-art VLMs, with only negligible additional computational cost. The authors provide open-source code and demonstrate a scalable framework for leveraging cross-modal structure to improve zero-shot RS classification, with potential extensions to prompts and few-shot, human-in-the-loop setups.

Abstract

Vision-Language Models for remote sensing have shown promising uses thanks to their extensive pretraining. However, their conventional usage in zero-shot scene classification methods still involves dividing large images into patches and making independent predictions, i.e., inductive inference, thereby limiting their effectiveness by ignoring valuable contextual information. Our approach tackles this issue by utilizing initial predictions based on text prompting and patch affinity relationships from the image encoder to enhance zero-shot capabilities through transductive inference, all without the need for supervision and at a minor computational cost. Experiments on 10 remote sensing datasets with state-of-the-art Vision-Language Models demonstrate significant accuracy improvements over inductive zero-shot classification. Our source code is publicly available on Github: https://github.com/elkhouryk/RS-TransCLIP

Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification

TL;DR

This work targets zero-shot remote sensing scene classification by addressing the limitations of inductive patch-wise inference in Vision-Language Models. It introduces RS-TransCLIP, a transductive approach operating entirely in the embedding space that combines a Gaussian Mixture Model, affinity-based Laplacian regularization, and KL divergence from initial text-derived pseudo-labels to refine predictions without supervision. The method yields consistent, significant accuracy gains across 10 RS datasets and multiple state-of-the-art VLMs, with only negligible additional computational cost. The authors provide open-source code and demonstrate a scalable framework for leveraging cross-modal structure to improve zero-shot RS classification, with potential extensions to prompts and few-shot, human-in-the-loop setups.

Abstract

Vision-Language Models for remote sensing have shown promising uses thanks to their extensive pretraining. However, their conventional usage in zero-shot scene classification methods still involves dividing large images into patches and making independent predictions, i.e., inductive inference, thereby limiting their effectiveness by ignoring valuable contextual information. Our approach tackles this issue by utilizing initial predictions based on text prompting and patch affinity relationships from the image encoder to enhance zero-shot capabilities through transductive inference, all without the need for supervision and at a minor computational cost. Experiments on 10 remote sensing datasets with state-of-the-art Vision-Language Models demonstrate significant accuracy improvements over inductive zero-shot classification. Our source code is publicly available on Github: https://github.com/elkhouryk/RS-TransCLIP
Paper Structure (14 sections, 7 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 14 sections, 7 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: Top-1 accuracy of RS-TransCLIP, on ViT-L/14 RS VLMs, for zero-shot scene classification across 10 datasets.
  • Figure 2: (a) VLMs assign each image to its closest text embedding and (b) RS-TransCLIP exploits the image-text structure to enhance the predictions without any additional labels.