Reprojection Errors as Prompts for Efficient Scene Coordinate Regression
Ting-Ru Liu, Hsuan-Kung Yang, Jou-Min Liu, Chun-Wei Huang, Tsung-Chih Chiang, Quan Kong, Norimasa Kobori, Chun-Yi Lee
TL;DR
The paper addresses SCR-based visual localization, highlighting how dynamic objects and textureless regions hinder training stability and accuracy. It introduces Error-Guided Feature Selection (EGFS) coupled with the Segment Anything Model (SAM) and a confidence refinement mechanism to seed low-reprojection-error prompts, expand them into masks, and iteratively sample robust training regions without relying on fixed semantic labels, all while weighting losses by per-pixel confidence $c_i$. Empirically, EGFS achieves state-of-the-art or competitive results on Cambridge Landmarks and Indoor6 with smaller model sizes and reduced training time, and ablations confirm the positive contribution of both EGFS and confidence refinement. The method demonstrates practical impact by enabling efficient, robust SCR-based localization in diverse environments, leveraging semantic context through SAM and data-driven focus on reliable regions. Overall, the work advances SCR by integrating error-driven, SEM-aware sampling with confidence-aware optimization, offering a scalable approach for accurate 6-DoF pose estimation in real-world scenarios.
Abstract
Scene coordinate regression (SCR) methods have emerged as a promising area of research due to their potential for accurate visual localization. However, many existing SCR approaches train on samples from all image regions, including dynamic objects and texture-less areas. Utilizing these areas for optimization during training can potentially hamper the overall performance and efficiency of the model. In this study, we first perform an in-depth analysis to validate the adverse impacts of these areas. Drawing inspiration from our analysis, we then introduce an error-guided feature selection (EGFS) mechanism, in tandem with the use of the Segment Anything Model (SAM). This mechanism seeds low reprojection areas as prompts and expands them into error-guided masks, and then utilizes these masks to sample points and filter out problematic areas in an iterative manner. The experiments demonstrate that our method outperforms existing SCR approaches that do not rely on 3D information on the Cambridge Landmarks and Indoor6 datasets.
