MeshVPR: Citywide Visual Place Recognition Using 3D Meshes

Gabriele Berton; Lorenz Junglas; Riccardo Zaccone; Thomas Pollok; Barbara Caputo; Carlo Masone

MeshVPR: Citywide Visual Place Recognition Using 3D Meshes

Gabriele Berton, Lorenz Junglas, Riccardo Zaccone, Thomas Pollok, Barbara Caputo, Carlo Masone

TL;DR

MeshVPR addresses citywide visual place recognition using 3D textured meshes by closing the domain gap between real query images and synthetic mesh-derived databases. It introduces a lightweight feature alignment framework that fine-tunes a synthetic-model to align with a real-model embeddings, enabling effective retrieval from synthetic databases with pretrained VPR backbones. The authors provide three new city-scale test sets with freely available meshes and demonstrate that MeshVPR delivers competitive performance while enabling scalable deployment, data reuse, and privacy advantages; they also analyze mesh quality, Syn2Real gap bridging, and training data requirements. The work points to practical implications for scalable mesh-based localization and outlines future directions such as full mesh-based VL pipelines, multi-domain synthetic imagery, and drone-like viewpoints for localization at city scale.

Abstract

Mesh-based scene representation offers a promising direction for simplifying large-scale hierarchical visual localization pipelines, combining a visual place recognition step based on global features (retrieval) and a visual localization step based on local features. While existing work demonstrates the viability of meshes for visual localization, the impact of using synthetic databases rendered from them in visual place recognition remains largely unexplored. In this work we investigate using dense 3D textured meshes for large-scale Visual Place Recognition (VPR). We identify a significant performance drop when using synthetic mesh-based image databases compared to real-world images for retrieval. To address this, we propose MeshVPR, a novel VPR pipeline that utilizes a lightweight features alignment framework to bridge the gap between real-world and synthetic domains. MeshVPR leverages pre-trained VPR models and is efficient and scalable for city-wide deployments. We introduce novel datasets with freely available 3D meshes and manually collected queries from Berlin, Paris, and Melbourne. Extensive evaluations demonstrate that MeshVPR achieves competitive performance with standard VPR pipelines, paving the way for mesh-based localization systems. Data, code, and interactive visualizations are available at https://meshvpr.github.io/

MeshVPR: Citywide Visual Place Recognition Using 3D Meshes

TL;DR

Abstract

Paper Structure (19 sections, 1 equation, 6 figures, 4 tables)

This paper contains 19 sections, 1 equation, 6 figures, 4 tables.

Introduction
Related work
Mesh-based Visual Place Recognition Pipeline
Step 1: Download images and mesh for the alignment
Step 2: Generate alignment images from mesh
Step 3: Features alignment
Step 4: Generate the test database
Step 5: Inference
Test sets
Experiments
Implementation details
Localizing Real Queries on a Synthetic Database
How 3D mesh quality affects results
Bridging the Syn2Real performance gap
Comparing MeshVPR with other strategies
...and 4 more sections

Figures (6)

Figure 1: Pairs of real images and their synthetic counterpart. Pairs like these are used for MeshVPR's features alignment.
Figure 2: Our proposed pipeline for mesh-based visual place recognition. The training phase consists in downloading training (real) images and the 3D mesh, generating their synthetic counterparts and specializing the synthetic model through feature alignment. Once the training phase is completed, the deployment phase can take part on any target city: in this paper we show results on Berlin, Paris and Melbourne.
Figure 3: Predictions with best MeshVPR model, namely SALAD + MeshVPR. Each triplet represents a query and its top 2 predictions, which are bounded in green if positive and red if negative. Qualitative examples help understand the results from \ref{['tab:main_table']}: Paris is challenging due to low quality meshes, and Melbourne is challenging due to wide open spaces. Interestingly, we note that the model learns to overcome long-term temporal changes (snow and winter/summer foliage in top-left query), occlusions (first two queries from Paris) and perspective changes (third query from Melbourne). A large number of (higher resolution) qualitative results are shown in the Supplementary.
Figure 4: Triplets of real, synthetic from HQ mesh, and synthetic from LQ mesh. These triplets allow to qualitatively understand how the quality of the mesh influences the generated images and results. The bottom-right triplet provides a examples of synthetic images with artifacts. They occur when the real image was taken in a covered area i.e tunnel or tree cover, and the viewpoint is within the mesh. Examples with such artifacts account for less than 1% of the dataset.
Figure 5: Results with MeshVPR on High Quality (HQ) and Low Quality (LQ) meshes. Quantitative results (left) indicate a strong correlation between results and mesh quality. All results on the table are computed with MeshVPR applied to different VPR models. Qualitative results (right) visually show how predictions are affected by the synthetically generated images. For each one of the 4 queries (i.e. the images without green/red boxes) we show the top-2 candidates with SALAD+MeshVPR on the high-quality (HQ) database (top row for each query) and the top-2 candidates with the low-quality (LQ) database.
...and 1 more figures

MeshVPR: Citywide Visual Place Recognition Using 3D Meshes

TL;DR

Abstract

MeshVPR: Citywide Visual Place Recognition Using 3D Meshes

Authors

TL;DR

Abstract

Table of Contents

Figures (6)