Table of Contents
Fetching ...

FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding

Zhenshi Li, Weikang Yu, Dilxat Muhtar, Xueliang Zhang, Pengfeng Xiao, Pedram Ghamisi, Xiao Xiang Zhu

TL;DR

FarSLIP addresses the limitation of CLIP's global alignment for fine-grained remote sensing by creating a multi-granularity RS image-text dataset (MGRS-200k) and analyzing the shortcomings of existing region-text alignment approaches. It demonstrates that preserving CLIP's CLS-based region-language coupling while adopting patch-to-patch local-global distillation yields superior fine-grained RS understanding. The proposed two-stage FarSLIP framework achieves state-of-the-art performance on open-vocabulary semantic segmentation, zero-shot classification, and cross-modal retrieval, driven by its effective use of region-category supervision and robust local-global alignment. The work provides practical guidance for RS VLFM data construction and fine-grained CLIP adaptation, with code and models released for reproducibility and broader uptake.

Abstract

As CLIP's global alignment limits its ability to capture fine-grained details, recent efforts have focused on enhancing its region-text alignment. However, current remote sensing (RS)-specific CLIP variants still inherit this limited spatial awareness. We identify two key limitations behind this: (1) current RS image-text datasets generate global captions from object-level labels, leaving the original object-level supervision underutilized; (2) despite the success of region-text alignment methods in general domain, their direct application to RS data often leads to performance degradation. To address these, we construct the first multi-granularity RS image-text dataset, MGRS-200k, featuring rich object-level textual supervision for RS region-category alignment. We further investigate existing fine-grained CLIP tuning strategies and find that current explicit region-text alignment methods, whether in a direct or indirect way, underperform due to severe degradation of CLIP's semantic coherence. Building on these, we propose FarSLIP, a Fine-grained Aligned RS Language-Image Pretraining framework. Rather than the commonly used patch-to-CLS self-distillation, FarSLIP employs patch-to-patch distillation to align local and global visual cues, which improves feature discriminability while preserving semantic coherence. Additionally, to effectively utilize region-text supervision, it employs simple CLS token-based region-category alignment rather than explicit patch-level alignment, further enhancing spatial awareness. FarSLIP features improved fine-grained vision-language alignment in RS domain and sets a new state of the art not only on RS open-vocabulary semantic segmentation, but also on image-level tasks such as zero-shot classification and image-text retrieval. Our dataset, code, and models are available at https://github.com/NJU-LHRS/FarSLIP.

FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding

TL;DR

FarSLIP addresses the limitation of CLIP's global alignment for fine-grained remote sensing by creating a multi-granularity RS image-text dataset (MGRS-200k) and analyzing the shortcomings of existing region-text alignment approaches. It demonstrates that preserving CLIP's CLS-based region-language coupling while adopting patch-to-patch local-global distillation yields superior fine-grained RS understanding. The proposed two-stage FarSLIP framework achieves state-of-the-art performance on open-vocabulary semantic segmentation, zero-shot classification, and cross-modal retrieval, driven by its effective use of region-category supervision and robust local-global alignment. The work provides practical guidance for RS VLFM data construction and fine-grained CLIP adaptation, with code and models released for reproducibility and broader uptake.

Abstract

As CLIP's global alignment limits its ability to capture fine-grained details, recent efforts have focused on enhancing its region-text alignment. However, current remote sensing (RS)-specific CLIP variants still inherit this limited spatial awareness. We identify two key limitations behind this: (1) current RS image-text datasets generate global captions from object-level labels, leaving the original object-level supervision underutilized; (2) despite the success of region-text alignment methods in general domain, their direct application to RS data often leads to performance degradation. To address these, we construct the first multi-granularity RS image-text dataset, MGRS-200k, featuring rich object-level textual supervision for RS region-category alignment. We further investigate existing fine-grained CLIP tuning strategies and find that current explicit region-text alignment methods, whether in a direct or indirect way, underperform due to severe degradation of CLIP's semantic coherence. Building on these, we propose FarSLIP, a Fine-grained Aligned RS Language-Image Pretraining framework. Rather than the commonly used patch-to-CLS self-distillation, FarSLIP employs patch-to-patch distillation to align local and global visual cues, which improves feature discriminability while preserving semantic coherence. Additionally, to effectively utilize region-text supervision, it employs simple CLS token-based region-category alignment rather than explicit patch-level alignment, further enhancing spatial awareness. FarSLIP features improved fine-grained vision-language alignment in RS domain and sets a new state of the art not only on RS open-vocabulary semantic segmentation, but also on image-level tasks such as zero-shot classification and image-text retrieval. Our dataset, code, and models are available at https://github.com/NJU-LHRS/FarSLIP.

Paper Structure

This paper contains 32 sections, 11 equations, 6 figures, 18 tables.

Figures (6)

  • Figure 1: Accuracy differences relative to CLIP for models fine-tuned with SOTA methods, e.g., FineCLIP jing2024fineclip and CLIPSelf wu2024clipself, on various RS datasets (shown in brackets). These models show only marginal improvement or even performance drops in most cases, indicating limited effectiveness for RS-domain application.
  • Figure 2: Overview of the analysis procedure. The item marked with a checkmark indicates the most effective method identified for RS-specific CLIP tuning.
  • Figure 3: Architecture of FarSLIP. Stage one is trained on image-caption data with $\mathcal{L}_{\mathrm{glo}}$ and $\mathcal{L}_{\mathrm{dis}}$, while stage two leverages image-caption and object-category pairs with $\mathcal{L}_{\mathrm{glo}}$ and $\mathcal{L}_{\mathrm{loc}}$.
  • Figure 4: Four examples from our proposed MGRS-200k dataset.
  • Figure 5: Cosine similarity between the anchor pixel (noted in ×) and all other pixels. Columns 2–4 correspond to the baseline (OpenAI CLIP), ROI embedding-based region-category alignment, and CLS token-based region-category alignment, respectively. ROI-based training tends to disrupt pixel-level semantic coherence, while CLS token-based training enhances it. Columns 5–7 correspond to the baseline (without self-distillation), RoI-to-CLS self-distillation, and RoI-to-Pooled self-distillation, respectively. The CLS token-based approach compromises semantic coherence, whereas RoI-to-Pooled effectively preserves it.
  • ...and 1 more figures