VG-SSL: Benchmarking Self-supervised Representation Learning Approaches for Visual Geo-localization

Jiuhong Xiao; Gao Zhu; Giuseppe Loianno

VG-SSL: Benchmarking Self-supervised Representation Learning Approaches for Visual Geo-localization

Jiuhong Xiao, Gao Zhu, Giuseppe Loianno

TL;DR

This study presents a novel VG-SSL framework, designed for versatile integration and benchmarking of diverse SSL methods for representation learning in VG, featuring a unique geo-related pair strategy, GeoPair.

Abstract

Visual Geo-localization (VG) is a critical research area for identifying geo-locations from visual inputs, particularly in autonomous navigation for robotics and vehicles. Current VG methods often learn feature extractors from geo-labeled images to create dense, geographically relevant representations. Recent advances in Self-Supervised Learning (SSL) have demonstrated its capability to achieve performance on par with supervised techniques with unlabeled images. This study presents a novel VG-SSL framework, designed for versatile integration and benchmarking of diverse SSL methods for representation learning in VG, featuring a unique geo-related pair strategy, GeoPair. Through extensive performance analysis, we adapt SSL techniques to improve VG on datasets from hand-held and car-mounted cameras used in robotics and autonomous vehicles. Our results show that contrastive learning and information maximization methods yield superior geo-specific representation quality, matching or surpassing the performance of state-of-the-art VG techniques. To our knowledge, This is the first benchmarking study of SSL in VG, highlighting its potential in enhancing geo-specific visual representations for robotics and autonomous vehicles. The code is publicly available at https://github.com/arplaboratory/VG-SSL.

VG-SSL: Benchmarking Self-supervised Representation Learning Approaches for Visual Geo-localization

TL;DR

Abstract

Paper Structure (25 sections, 5 equations, 7 figures, 7 tables)

This paper contains 25 sections, 5 equations, 7 figures, 7 tables.

Introduction
Related Works
Methodology
Loss functions
Experimental Setup
Datasets and Metrics
Implementation Details
Results
Comparison with State-of-the-art
Ablation study
Hard Negative Mining
Number of Projection Layers
Projected Embedding Dimensionality
Feature Embeddings Dimensionality
Visualization
...and 10 more sections

Figures (7)

Figure 1: VG-SSL Framework Overview: This framework integrates various Visual Geo-localization (VG) datasets, models, and Self-Supervised Learning (SSL) loss functions for representation learning. It benchmarks VG performance across different SSL strategies trained with the geo-related pair strategy, GeoPair, and offers an in-depth analysis of SSL method settings tailored for geo-specific representation learning.
Figure 2: VG-SSL Architecture: During training, query images $I_q$ and positive database images $I_{k^p}$ are sampled, with optional negative images $I_{k^n}$ selected via HNM. GeoPair strategy builds image pairs using query-positive pairs $I_q$, $I_{k^p}$ and augmented negative pairs $I^{t}_{k^n}$, $I^{t^\prime}_{k^n}$ with augmentation $t, t^\prime \sim T$. The feature extractor $F$ then produces embeddings ($q$, $k^p$, $k_{t}^n$, and $k_{t^\prime}^n$), and SSL loss is applied to train $F$. During inference, the projection head is removed, and KNN is used with feature embeddings $\tilde{q}$ and $\tilde{k}$ from the aggregation module.
Figure 3: The activation maps of ResNet-50 models trained with Triplet Loss (Baseline) and SSL methods. For each dataset, the first row is for the query image and the second row is for the positive sample image.
Figure 4: Visualization of top-5 retrieved candidates for illumination change across different SSL training strategies
Figure 5: Visualization of top-5 retrieved candidates for season change across different SSL training strategies
...and 2 more figures

VG-SSL: Benchmarking Self-supervised Representation Learning Approaches for Visual Geo-localization

TL;DR

Abstract

VG-SSL: Benchmarking Self-supervised Representation Learning Approaches for Visual Geo-localization

Authors

TL;DR

Abstract

Table of Contents

Figures (7)