Table of Contents
Fetching ...

Breaking the Frame: Visual Place Recognition by Overlap Prediction

Tong Wei, Philipp Lindenberger, Jiri Matas, Daniel Barath

TL;DR

This work tackles visual place recognition under occlusion by reframing retrieval as visual overlap prediction rather than global similarity. It introduces VOP, which uses patch-level embeddings from a Vision Transformer and a robust voting scheme to measure co-visible overlap between query and database images, enabling effective top-k retrieval for pose estimation. Trained with contrastive, patch-level supervision and 3D-reconstruction-derived ground-truth, VOP generalizes well across MegaDepth, ETH3D, PhotoTourism, and InLoc, often outperforming state-of-the-art global and reranking baselines and accelerating pose-graph construction. The approach offers practical benefits for real-world localization and 3D reconstruction in challenging, partially occluded environments, and the authors provide open-source code for adoption and further research.

Abstract

Visual place recognition methods struggle with occlusions and partial visual overlaps. We propose a novel visual place recognition approach based on overlap prediction, called VOP, shifting from traditional reliance on global image similarities and local features to image overlap prediction. VOP proceeds co-visible image sections by obtaining patch-level embeddings using a Vision Transformer backbone and establishing patch-to-patch correspondences without requiring expensive feature detection and matching. Our approach uses a voting mechanism to assess overlap scores for potential database images. It provides a nuanced image retrieval metric in challenging scenarios. Experimental results show that VOP leads to more accurate relative pose estimation and localization results on the retrieved image pairs than state-of-the-art baselines on a number of large-scale, real-world indoor and outdoor benchmarks. The code is available at https://github.com/weitong8591/vop.git.

Breaking the Frame: Visual Place Recognition by Overlap Prediction

TL;DR

This work tackles visual place recognition under occlusion by reframing retrieval as visual overlap prediction rather than global similarity. It introduces VOP, which uses patch-level embeddings from a Vision Transformer and a robust voting scheme to measure co-visible overlap between query and database images, enabling effective top-k retrieval for pose estimation. Trained with contrastive, patch-level supervision and 3D-reconstruction-derived ground-truth, VOP generalizes well across MegaDepth, ETH3D, PhotoTourism, and InLoc, often outperforming state-of-the-art global and reranking baselines and accelerating pose-graph construction. The approach offers practical benefits for real-world localization and 3D reconstruction in challenging, partially occluded environments, and the authors provide open-source code for adoption and further research.

Abstract

Visual place recognition methods struggle with occlusions and partial visual overlaps. We propose a novel visual place recognition approach based on overlap prediction, called VOP, shifting from traditional reliance on global image similarities and local features to image overlap prediction. VOP proceeds co-visible image sections by obtaining patch-level embeddings using a Vision Transformer backbone and establishing patch-to-patch correspondences without requiring expensive feature detection and matching. Our approach uses a voting mechanism to assess overlap scores for potential database images. It provides a nuanced image retrieval metric in challenging scenarios. Experimental results show that VOP leads to more accurate relative pose estimation and localization results on the retrieved image pairs than state-of-the-art baselines on a number of large-scale, real-world indoor and outdoor benchmarks. The code is available at https://github.com/weitong8591/vop.git.

Paper Structure

This paper contains 14 sections, 9 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: An example where the SOTA AnyLoc keetha2023anyloc scores a negative DB image (right, at a different location) higher than an occluded positive example (left, the same scene as the query with heavy occlusion). VOP ranks the database (DB) images correctly.
  • Figure 2: A patch matching example with a patch number of $8^2$. The 256 patches from DINOv2 oquab2023dinov2 are average pooled to 64. The numbers inside the patches indicate which ones are matched. The color overlay is calculated by PCA on patch embeddings.
  • Figure 3: The training pipeline of the proposed Visual Overlap Prediction (VOP) model, including a frozen DINOv2 backbone oquab2023dinov2 breaking down input images into rectangular patches, a trainable encoder head, and contrastive loss using patch-to-patch overlap supervision.
  • Figure 4: The proposed VOP at inference. Given an input query image and a reference image collection, the frozen backbone oquab2023dinov2 extracts patch-level features, which are then fed into our trained encoder to obtain the final embeddings. For each patch in the query image, the radius neighbor search is performed in the embedding space. The final overlap scores are determined by robust voting on the formed "query"-to-"database" patch neighbors. In practice, the map embeddings are offline pre-calculated and saved.
  • Figure 5: The number of connected components (vertical axis, cc) is plotted against the index of the image pair (horizontal) on which RANSAC-based relative pose estimation runs. The left plot shows results for the top-1 database (DB) images paired with $0.4K$ query images, where the number of pairs equals the number of queries. The right plot shows results for the top-10 DB images with a termination criterion applied when all images are in a single component. Row "max $cc_{size}$" is the number of elements in the largest cc. Row "# cc" is the final number of connected components, while "$\text{idx}_{\text{last}}$" shows the index when the termination criterion was triggered in the right plot. Row "# skipped", "# success", and "# failure" show the numbers of skipped, succeeded, and failed RANSACs. All are in percentages.
  • ...and 5 more figures