Statewide Visual Geolocalization in the Wild

Florian Fervers; Sebastian Bullinger; Christoph Bodensteiner; Michael Arens; Rainer Stiefelhagen

Statewide Visual Geolocalization in the Wild

Florian Fervers, Sebastian Bullinger, Christoph Bodensteiner, Michael Arens, Rainer Stiefelhagen

Abstract

This work presents a method that is able to predict the geolocation of a street-view photo taken in the wild within a state-sized search region by matching against a database of aerial reference imagery. We partition the search region into geographical cells and train a model to map cells and corresponding photos into a joint embedding space that is used to perform retrieval at test time. The model utilizes aerial images for each cell at multiple levels-of-detail to provide sufficient information about the surrounding scene. We propose a novel layout of the search region with consistent cell resolutions that allows scaling to large geographical regions. Experiments demonstrate that the method successfully localizes 60.6% of all non-panoramic street-view photos uploaded to the crowd-sourcing platform Mapillary in the state of Massachusetts to within 50m of their ground-truth location. Source code is available at https://github.com/fferflo/statewide-visual-geolocalization.

Statewide Visual Geolocalization in the Wild

Abstract

Paper Structure (23 sections, 2 equations, 9 figures, 3 tables)

This paper contains 23 sections, 2 equations, 9 figures, 3 tables.

Introduction
Related Work
Street-view to Street-view Geolocalization
Cross-view Geolocalization
Overview
Problem formulation
Methods
Method
Search Region Layout
Choice of Aerial Images per Cell
Model
Training Setup
Hard Example Mining
Evaluation
Data
...and 8 more sections

Figures (9)

Figure 1: Successful localization of street-view photos in the state of Massachusetts. The search region's color indicates the predicted score for possible camera locations. The crosshair shows the ground-truth location.
Figure 2: Example of the search region layout. Each box in (a) represents a search region cell and is assigned an embedding vector. The embedding is predicted using multiple resolutions of aerial imagery centered on the cell. This provides higher detail for parts of the scene that are close to the street-view camera, and less detail for parts that are further away and appear smaller in the photo.
Figure 3: Different choices of aerial images per search region cell. The solid square delinates the search region cell. We position an example street-view camera with limited FOV at the center of the cell. A dashed square represents a single aerial image. The highlighted triangular region indicates the parts of the aerial image(s) that overlap with the camera frustrum.
Figure 4: Overview of the architecture. Our model uses the ConvNeXt backbone liu2022convnet to encode images, and a multi-head attention block to pool the encoded tokens into a single embedding representation vaswani2017attentionlee2019set. We use shared weights for the model applied to the aerial images.
Figure 5: Localization of street-view photos in the state of Massachusetts. Overall, 60.6$\%$ of images are localized correctly. False positives (FP) are cells that are scored higher by the model than all cells within 50m of the ground-truth location. The search region's color indicates the predicted score for possible camera locations. The white circle delinates the 50m radius around the ground-truth position.
...and 4 more figures

Statewide Visual Geolocalization in the Wild

Abstract

Statewide Visual Geolocalization in the Wild

Authors

Abstract

Table of Contents

Figures (9)