Table of Contents
Fetching ...

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization

Guopeng Li, Ming Qian, Gui-Song Xia

TL;DR

An unsupervised framework including a cross-view projection to guide the model for retrieving initial pseudo-labels and a fast re-ranking mechanism to refine the pseudo-labels by leveraging the fact that “the perfectly paired ground-satellite image is located in a unique and identical scene” is proposed.

Abstract

This paper investigates the effective utilization of unlabeled data for large-area cross-view geo-localization (CVGL), encompassing both unsupervised and semi-supervised settings. Common approaches to CVGL rely on ground-satellite image pairs and employ label-driven supervised training. However, the cost of collecting precise cross-view image pairs hinders the deployment of CVGL in real-life scenarios. Without the pairs, CVGL will be more challenging to handle the significant imaging and spatial gaps between ground and satellite images. To this end, we propose an unsupervised framework including a cross-view projection to guide the model for retrieving initial pseudo-labels and a fast re-ranking mechanism to refine the pseudo-labels by leveraging the fact that ``the perfectly paired ground-satellite image is located in a unique and identical scene". The framework exhibits competitive performance compared with supervised works on three open-source benchmarks. Our code and models will be released on https://github.com/liguopeng0923/UCVGL.

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization

TL;DR

An unsupervised framework including a cross-view projection to guide the model for retrieving initial pseudo-labels and a fast re-ranking mechanism to refine the pseudo-labels by leveraging the fact that “the perfectly paired ground-satellite image is located in a unique and identical scene” is proposed.

Abstract

This paper investigates the effective utilization of unlabeled data for large-area cross-view geo-localization (CVGL), encompassing both unsupervised and semi-supervised settings. Common approaches to CVGL rely on ground-satellite image pairs and employ label-driven supervised training. However, the cost of collecting precise cross-view image pairs hinders the deployment of CVGL in real-life scenarios. Without the pairs, CVGL will be more challenging to handle the significant imaging and spatial gaps between ground and satellite images. To this end, we propose an unsupervised framework including a cross-view projection to guide the model for retrieving initial pseudo-labels and a fast re-ranking mechanism to refine the pseudo-labels by leveraging the fact that ``the perfectly paired ground-satellite image is located in a unique and identical scene". The framework exhibits competitive performance compared with supervised works on three open-source benchmarks. Our code and models will be released on https://github.com/liguopeng0923/UCVGL.
Paper Structure (19 sections, 2 equations, 6 figures, 5 tables)

This paper contains 19 sections, 2 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Task Settings. (A): Unsupervised Image Retrieval hu2022feature has object-level images and applies class-level clusters by class semantics. (B): Unsupervised CVGL has scene-level images and applies cross-view alignments by spatial correspondences (i.e., green lines). Compared with supervised CVGL, GPS labels and paired annotations (i.e., correspondences between ground and satellite images) are not accessible in UCVGL. (C): Common image retrievals are class-level (e.g., UReID he2020fastreid and UIR hu2022feature), but CVGL Sample4Geo aims to align cross-view images in the same scene, which is fine-grained instance-level Sample4Geo. The Top-k most similar images are more discriminative without class semantics in CVGL.
  • Figure 2: Pipeline Overview. Firstly, we train two separate encoders with ground panoramas and projected images to initialize a cross-view feature space for solving cold-start problems. Secondly, we train ground-satellite image pairs by sampling from adaptive pseudo-labels.
  • Figure 3: Projections. We project geometrically ground panoramas to BEV-view images on the left and transform BEV into fake images that resemble satellite images on the right.
  • Figure 4: Self-supervised contrastive learning. We learn intra-view discriminative features by attracting two self-augmented images and cross-view alignments by attracting cleverly ground images and projected fake images.
  • Figure 5: Semi-supervised curriculum learning. Ground images and satellite images are attracted and supervised through the guidance of adaptive pseudo-labels.
  • ...and 1 more figures