ProGEO: Generating Prompts through Image-Text Contrastive Learning for Visual Geo-localization

Chen Mao; Jingqi Hu

ProGEO: Generating Prompts through Image-Text Contrastive Learning for Visual Geo-localization

Chen Mao, Jingqi Hu

TL;DR

This work uses the multi-modal description capability of CLIP (Contrastive Language-Image Pretraining) to create a set of learnable text prompts for each geographic image feature to form vague descriptions, which enables the image encoder to learn better and more generalizable visual features.

Abstract

Visual Geo-localization (VG) refers to the process to identify the location described in query images, which is widely applied in robotics field and computer vision tasks, such as autonomous driving, metaverse, augmented reality, and SLAM. In fine-grained images lacking specific text descriptions, directly applying pure visual methods to represent neighborhood features often leads to the model focusing on overly fine-grained features, unable to fully mine the semantic information in the images. Therefore, we propose a two-stage training method to enhance visual performance and use contrastive learning to mine challenging samples. We first leverage the multi-modal description capability of CLIP (Contrastive Language-Image Pretraining) to create a set of learnable text prompts for each geographic image feature to form vague descriptions. Then, by utilizing dynamic text prompts to assist the training of the image encoder, we enable the image encoder to learn better and more generalizable visual features. This strategy of applying text to purely visual tasks addresses the challenge of using multi-modal models for geographic images, which often suffer from a lack of precise descriptions, making them difficult to utilize widely. We validate the effectiveness of the proposed strategy on several large-scale visual geo-localization datasets, and our method achieves competitive results on multiple visual geo-localization datasets. Our code and model are available at https://github.com/Chain-Mao/ProGEO.

ProGEO: Generating Prompts through Image-Text Contrastive Learning for Visual Geo-localization

TL;DR

Abstract

Paper Structure (20 sections, 8 equations, 5 figures, 3 tables)

This paper contains 20 sections, 8 equations, 5 figures, 3 tables.

Introduction
Related Works
Visual Geo-localization
Visual Language Model
Method
The First Training Stage
The Second Training Stage
Experiments
Datasets and Evaluation Metrics
Datasets
Evaluation Metrics
Implementation Details
Image Backbone
Training Details
Comparison with State-of-the-Art Methods
...and 5 more sections

Figures (5)

Figure 1: Models employing ResNet-50 as the image encoder demonstrated results on two challenging query image datasets Pitts30k and St Lucia, revealing the top five matching results with database images.
Figure 2: The overall architecture of our model with a ViT image encoder backbone.
Figure 3: The first training stage for model ProGEO.
Figure 4: The second training stage for model ProGEO.
Figure 5: Ablation on the number of frozen layers for ViT/B-32 image encoder.

ProGEO: Generating Prompts through Image-Text Contrastive Learning for Visual Geo-localization

TL;DR

Abstract

ProGEO: Generating Prompts through Image-Text Contrastive Learning for Visual Geo-localization

Authors

TL;DR

Abstract

Table of Contents

Figures (5)