Table of Contents
Fetching ...

Adapting Fine-Grained Cross-View Localization to Areas without Fine Ground Truth

Zimin Xia, Yujiao Shi, Hongdong Li, Julian F. P. Kooij

TL;DR

A weakly supervised learning approach based on knowledge self-distillation that consistently and considerably boosts the localization accuracy in the target area and is validated using two recent state-of-the-art models on two benchmarks.

Abstract

Given a ground-level query image and a geo-referenced aerial image that covers the query's local surroundings, fine-grained cross-view localization aims to estimate the location of the ground camera inside the aerial image. Recent works have focused on developing advanced networks trained with accurate ground truth (GT) locations of ground images. However, the trained models always suffer a performance drop when applied to images in a new target area that differs from training. In most deployment scenarios, acquiring fine GT, i.e. accurate GT locations, for target-area images to re-train the network can be expensive and sometimes infeasible. In contrast, collecting images with noisy GT with errors of tens of meters is often easy. Motivated by this, our paper focuses on improving the performance of a trained model in a new target area by leveraging only the target-area images without fine GT. We propose a weakly supervised learning approach based on knowledge self-distillation. This approach uses predictions from a pre-trained model as pseudo GT to supervise a copy of itself. Our approach includes a mode-based pseudo GT generation for reducing uncertainty in pseudo GT and an outlier filtering method to remove unreliable pseudo GT. Our approach is validated using two recent state-of-the-art models on two benchmarks. The results demonstrate that it consistently and considerably boosts the localization accuracy in the target area.

Adapting Fine-Grained Cross-View Localization to Areas without Fine Ground Truth

TL;DR

A weakly supervised learning approach based on knowledge self-distillation that consistently and considerably boosts the localization accuracy in the target area and is validated using two recent state-of-the-art models on two benchmarks.

Abstract

Given a ground-level query image and a geo-referenced aerial image that covers the query's local surroundings, fine-grained cross-view localization aims to estimate the location of the ground camera inside the aerial image. Recent works have focused on developing advanced networks trained with accurate ground truth (GT) locations of ground images. However, the trained models always suffer a performance drop when applied to images in a new target area that differs from training. In most deployment scenarios, acquiring fine GT, i.e. accurate GT locations, for target-area images to re-train the network can be expensive and sometimes infeasible. In contrast, collecting images with noisy GT with errors of tens of meters is often easy. Motivated by this, our paper focuses on improving the performance of a trained model in a new target area by leveraging only the target-area images without fine GT. We propose a weakly supervised learning approach based on knowledge self-distillation. This approach uses predictions from a pre-trained model as pseudo GT to supervise a copy of itself. Our approach includes a mode-based pseudo GT generation for reducing uncertainty in pseudo GT and an outlier filtering method to remove unreliable pseudo GT. Our approach is validated using two recent state-of-the-art models on two benchmarks. The results demonstrate that it consistently and considerably boosts the localization accuracy in the target area.
Paper Structure (16 sections, 9 equations, 12 figures, 2 tables)

This paper contains 16 sections, 9 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Learning-based cross-view localization models often perform well when test images are from the same area used in training, as shown in the green box. When inference in a new target area where no fine ground truth is available, the standard practice (in purple) directly deploys a model trained in a different area, leaving an obvious domain gap. Due to this domain gap, the direct generalization often results in a performance drop, causing uncertain or erroneous predictions. To address this, we propose a knowledge self-distillation-based weakly-supervised learning approach (in cyan) to adapt the model to the target area using only ground-aerial image pairs without fine ground truth locations. This leads to better localization performance.
  • Figure 1: VIGOR test set errors (vertical axis) of CCVPE models fintuned on noisy ground truth. The horizontal axis denotes the upper bound for error sampling.
  • Figure 2: Overview of our proposed weakly-supervised learning approach. We first employ a teacher model trained on data from another area to generate pseudo GT, $\mathbfcal{P}_\beta$, on target-area images, shown in blue. The pseudo GT is then used to train an auxiliary student model $\mathcal{M}_o$. After that, we compare the predictions from the teacher model and those from the auxiliary student model, and filter out unreliable teacher predictions (the middle grey box of this figure). The remaining predictions with their pseudo GT, $\mathbfcal{P}_{\Tilde{\beta}}$, are used to train our final student model $\mathcal{M}_\beta$, shown in green.
  • Figure 2: Errors of CCVPE models with different entropy minimization weights $\omega$ on VIGOR validation set.
  • Figure 3: CCVPE teacher and student model's predictions on VIGOR test set. The red color denotes the localization probability (a darker color means a higher probability).
  • ...and 7 more figures