Table of Contents
Fetching ...

GeoAlignCLIP: Enhancing Fine-Grained Vision-Language Alignment in Remote Sensing via Multi-Granular Consistency Learning

Xiao Yang, Ronghao Fu, Zhuoran Duan, Zhiwen Lin, Xueyan Liu, Bo Yang

TL;DR

GeoAlignCLIP is proposed, a unified framework that achieves fine-grained alignment in remote sensing tasks by learning multi-granular semantic alignments and incorporating intra-modal consistency, enabling more precise visual-semantic alignment between image regions and text concepts.

Abstract

Vision-language pretraining models have made significant progress in bridging remote sensing imagery with natural language. However, existing approaches often fail to effectively integrate multi-granular visual and textual information, relying primarily on global image-text alignment. This limitation hinders the model's ability to accurately capture fine-grained details in images, thus restricting its performance in complex, fine-grained tasks. To address this, we propose GeoAlignCLIP, a unified framework that achieves fine-grained alignment in remote sensing tasks by learning multi-granular semantic alignments and incorporating intra-modal consistency, enabling more precise visual-semantic alignment between image regions and text concepts. Additionally, we construct RSFG-100k, a fine-granular remote sensing dataset containing scene descriptions, region-level annotations, and challenging hard-negative samples, providing hierarchical supervision for model training. Extensive experiments conducted on multiple public remote-sensing benchmarks demonstrate that GeoAlignCLIP consistently outperforms existing RS-specific methods across diverse tasks, exhibiting more robust and accurate fine-grained vision-language alignment.

GeoAlignCLIP: Enhancing Fine-Grained Vision-Language Alignment in Remote Sensing via Multi-Granular Consistency Learning

TL;DR

GeoAlignCLIP is proposed, a unified framework that achieves fine-grained alignment in remote sensing tasks by learning multi-granular semantic alignments and incorporating intra-modal consistency, enabling more precise visual-semantic alignment between image regions and text concepts.

Abstract

Vision-language pretraining models have made significant progress in bridging remote sensing imagery with natural language. However, existing approaches often fail to effectively integrate multi-granular visual and textual information, relying primarily on global image-text alignment. This limitation hinders the model's ability to accurately capture fine-grained details in images, thus restricting its performance in complex, fine-grained tasks. To address this, we propose GeoAlignCLIP, a unified framework that achieves fine-grained alignment in remote sensing tasks by learning multi-granular semantic alignments and incorporating intra-modal consistency, enabling more precise visual-semantic alignment between image regions and text concepts. Additionally, we construct RSFG-100k, a fine-granular remote sensing dataset containing scene descriptions, region-level annotations, and challenging hard-negative samples, providing hierarchical supervision for model training. Extensive experiments conducted on multiple public remote-sensing benchmarks demonstrate that GeoAlignCLIP consistently outperforms existing RS-specific methods across diverse tasks, exhibiting more robust and accurate fine-grained vision-language alignment.
Paper Structure (15 sections, 5 equations, 5 figures, 7 tables)

This paper contains 15 sections, 5 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: (a) Comparison of feature-map RoI cropping and pixel-space RoI cropping methods for vision-language alignment. (b) Comparison of attention heatmaps for models trained with different text granularities (brief, enriched, and multi-granular) in capturing global and local semantic relationships.
  • Figure 2: Overall architecture and training pipeline of GeoAlignCLIP. (a) Stage I performs global contrastive learning. (b) Stage II conducts Multi-Granularity Contrastive Learning and Multi-View Consistency Learning.
  • Figure 3: Comparisons of open-vocabulary object detection task on the DIOR and DOTAv1.0. mAP$_{\text{n}}$ and mAP$_{\text{b}}$ denote mean average precision on novel and base classes, respectively.
  • Figure 4: Visualization results of open-vocabulary object detection.
  • Figure 5: Visualization of fine-grained vision-language alignment by GeoAlignCLIP.