Breaking the Frame: Visual Place Recognition by Overlap Prediction
Tong Wei, Philipp Lindenberger, Jiri Matas, Daniel Barath
TL;DR
This work tackles visual place recognition under occlusion by reframing retrieval as visual overlap prediction rather than global similarity. It introduces VOP, which uses patch-level embeddings from a Vision Transformer and a robust voting scheme to measure co-visible overlap between query and database images, enabling effective top-k retrieval for pose estimation. Trained with contrastive, patch-level supervision and 3D-reconstruction-derived ground-truth, VOP generalizes well across MegaDepth, ETH3D, PhotoTourism, and InLoc, often outperforming state-of-the-art global and reranking baselines and accelerating pose-graph construction. The approach offers practical benefits for real-world localization and 3D reconstruction in challenging, partially occluded environments, and the authors provide open-source code for adoption and further research.
Abstract
Visual place recognition methods struggle with occlusions and partial visual overlaps. We propose a novel visual place recognition approach based on overlap prediction, called VOP, shifting from traditional reliance on global image similarities and local features to image overlap prediction. VOP proceeds co-visible image sections by obtaining patch-level embeddings using a Vision Transformer backbone and establishing patch-to-patch correspondences without requiring expensive feature detection and matching. Our approach uses a voting mechanism to assess overlap scores for potential database images. It provides a nuanced image retrieval metric in challenging scenarios. Experimental results show that VOP leads to more accurate relative pose estimation and localization results on the retrieved image pairs than state-of-the-art baselines on a number of large-scale, real-world indoor and outdoor benchmarks. The code is available at https://github.com/weitong8591/vop.git.
