Table of Contents
Fetching ...

CV-Cities: Advancing Cross-View Geo-Localization in Global Cities

Gaoshuang Huang, Yang Zhou, Luying Zhao, Wenjian Gan

TL;DR

A novel CVGL framework that integrates the vision foundational model DINOv2 with an advanced feature mixer is proposed that introduces the symmetric InfoNCE loss and incorporates near-neighbor sampling and dynamic similarity sampling strategies, significantly enhancing localization accuracy.

Abstract

Cross-view geo-localization (CVGL), which involves matching and retrieving satellite images to determine the geographic location of a ground image, is crucial in GNSS-constrained scenarios. However, this task faces significant challenges due to substantial viewpoint discrepancies, the complexity of localization scenarios, and the need for global localization. To address these issues, we propose a novel CVGL framework that integrates the vision foundational model DINOv2 with an advanced feature mixer. Our framework introduces the symmetric InfoNCE loss and incorporates near-neighbor sampling and dynamic similarity sampling strategies, significantly enhancing localization accuracy. Experimental results show that our framework surpasses existing methods across multiple public and self-built datasets. To further improve globalscale performance, we have developed CV-Cities, a novel dataset for global CVGL. CV-Cities includes 223,736 ground-satellite image pairs with geolocation data, spanning sixteen cities across six continents and covering a wide range of complex scenarios, providing a challenging benchmark for CVGL. The framework trained with CV-Cities demonstrates high localization accuracy in various test cities, highlighting its strong globalization and generalization capabilities. Our datasets and codes are available at https://github.com/GaoShuang98/CVCities.

CV-Cities: Advancing Cross-View Geo-Localization in Global Cities

TL;DR

A novel CVGL framework that integrates the vision foundational model DINOv2 with an advanced feature mixer is proposed that introduces the symmetric InfoNCE loss and incorporates near-neighbor sampling and dynamic similarity sampling strategies, significantly enhancing localization accuracy.

Abstract

Cross-view geo-localization (CVGL), which involves matching and retrieving satellite images to determine the geographic location of a ground image, is crucial in GNSS-constrained scenarios. However, this task faces significant challenges due to substantial viewpoint discrepancies, the complexity of localization scenarios, and the need for global localization. To address these issues, we propose a novel CVGL framework that integrates the vision foundational model DINOv2 with an advanced feature mixer. Our framework introduces the symmetric InfoNCE loss and incorporates near-neighbor sampling and dynamic similarity sampling strategies, significantly enhancing localization accuracy. Experimental results show that our framework surpasses existing methods across multiple public and self-built datasets. To further improve globalscale performance, we have developed CV-Cities, a novel dataset for global CVGL. CV-Cities includes 223,736 ground-satellite image pairs with geolocation data, spanning sixteen cities across six continents and covering a wide range of complex scenarios, providing a challenging benchmark for CVGL. The framework trained with CV-Cities demonstrates high localization accuracy in various test cities, highlighting its strong globalization and generalization capabilities. Our datasets and codes are available at https://github.com/GaoShuang98/CVCities.

Paper Structure

This paper contains 34 sections, 9 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: The city and sample points distribution map of CV-Cities datasets. (a) The city distribution map of the CV-Cities dataset, with red and green dots representing training and testing cities respectively. (b) Sample points distribution map of eight cities (out of sixteen) in the CV-Cities dataset.
  • Figure 2: Examples of ground and satellite images of different types of scenes. The left image of each scene pair is the ground image, and the right image is the corresponding satellite image.
  • Figure 3: The distribution of images in CV-Cities on scenes, yearly and monthly scale. (a) Scenes distribution. (b) Yearly distribution. (c) Monthly distribution.
  • Figure 4: Our framework of CVGL.
  • Figure 5: The structures of DINOv2 and feature transformation.
  • ...and 6 more figures