GeoAlignCLIP: Enhancing Fine-Grained Vision-Language Alignment in Remote Sensing via Multi-Granular Consistency Learning

Xiao Yang; Ronghao Fu; Zhuoran Duan; Zhiwen Lin; Xueyan Liu; Bo Yang

GeoAlignCLIP: Enhancing Fine-Grained Vision-Language Alignment in Remote Sensing via Multi-Granular Consistency Learning

Xiao Yang, Ronghao Fu, Zhuoran Duan, Zhiwen Lin, Xueyan Liu, Bo Yang

TL;DR

GeoAlignCLIP is proposed, a unified framework that achieves fine-grained alignment in remote sensing tasks by learning multi-granular semantic alignments and incorporating intra-modal consistency, enabling more precise visual-semantic alignment between image regions and text concepts.

Abstract

Vision-language pretraining models have made significant progress in bridging remote sensing imagery with natural language. However, existing approaches often fail to effectively integrate multi-granular visual and textual information, relying primarily on global image-text alignment. This limitation hinders the model's ability to accurately capture fine-grained details in images, thus restricting its performance in complex, fine-grained tasks. To address this, we propose GeoAlignCLIP, a unified framework that achieves fine-grained alignment in remote sensing tasks by learning multi-granular semantic alignments and incorporating intra-modal consistency, enabling more precise visual-semantic alignment between image regions and text concepts. Additionally, we construct RSFG-100k, a fine-granular remote sensing dataset containing scene descriptions, region-level annotations, and challenging hard-negative samples, providing hierarchical supervision for model training. Extensive experiments conducted on multiple public remote-sensing benchmarks demonstrate that GeoAlignCLIP consistently outperforms existing RS-specific methods across diverse tasks, exhibiting more robust and accurate fine-grained vision-language alignment.

GeoAlignCLIP: Enhancing Fine-Grained Vision-Language Alignment in Remote Sensing via Multi-Granular Consistency Learning

TL;DR

Abstract

Paper Structure (15 sections, 5 equations, 5 figures, 7 tables)

This paper contains 15 sections, 5 equations, 5 figures, 7 tables.

Introduction
Related Works
Methodology
Global Contrastive Learning
Multi-Granularity Contrastive Learning
Multi-View Consistency Learning
Overall Objective
RSFG-100k Dataset Construction
Experiments
Implementation Details
Comparisons with state-of-the-art Methods
Ablation Study
Parameter Size and Efficiency
Visualization Study
Conclusion

Figures (5)

Figure 1: (a) Comparison of feature-map RoI cropping and pixel-space RoI cropping methods for vision-language alignment. (b) Comparison of attention heatmaps for models trained with different text granularities (brief, enriched, and multi-granular) in capturing global and local semantic relationships.
Figure 2: Overall architecture and training pipeline of GeoAlignCLIP. (a) Stage I performs global contrastive learning. (b) Stage II conducts Multi-Granularity Contrastive Learning and Multi-View Consistency Learning.
Figure 3: Comparisons of open-vocabulary object detection task on the DIOR and DOTAv1.0. mAP$_{\text{n}}$ and mAP$_{\text{b}}$ denote mean average precision on novel and base classes, respectively.
Figure 4: Visualization results of open-vocabulary object detection.
Figure 5: Visualization of fine-grained vision-language alignment by GeoAlignCLIP.

GeoAlignCLIP: Enhancing Fine-Grained Vision-Language Alignment in Remote Sensing via Multi-Granular Consistency Learning

TL;DR

Abstract

GeoAlignCLIP: Enhancing Fine-Grained Vision-Language Alignment in Remote Sensing via Multi-Granular Consistency Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)