Table of Contents
Fetching ...

Reprojection Errors as Prompts for Efficient Scene Coordinate Regression

Ting-Ru Liu, Hsuan-Kung Yang, Jou-Min Liu, Chun-Wei Huang, Tsung-Chih Chiang, Quan Kong, Norimasa Kobori, Chun-Yi Lee

TL;DR

The paper addresses SCR-based visual localization, highlighting how dynamic objects and textureless regions hinder training stability and accuracy. It introduces Error-Guided Feature Selection (EGFS) coupled with the Segment Anything Model (SAM) and a confidence refinement mechanism to seed low-reprojection-error prompts, expand them into masks, and iteratively sample robust training regions without relying on fixed semantic labels, all while weighting losses by per-pixel confidence $c_i$. Empirically, EGFS achieves state-of-the-art or competitive results on Cambridge Landmarks and Indoor6 with smaller model sizes and reduced training time, and ablations confirm the positive contribution of both EGFS and confidence refinement. The method demonstrates practical impact by enabling efficient, robust SCR-based localization in diverse environments, leveraging semantic context through SAM and data-driven focus on reliable regions. Overall, the work advances SCR by integrating error-driven, SEM-aware sampling with confidence-aware optimization, offering a scalable approach for accurate 6-DoF pose estimation in real-world scenarios.

Abstract

Scene coordinate regression (SCR) methods have emerged as a promising area of research due to their potential for accurate visual localization. However, many existing SCR approaches train on samples from all image regions, including dynamic objects and texture-less areas. Utilizing these areas for optimization during training can potentially hamper the overall performance and efficiency of the model. In this study, we first perform an in-depth analysis to validate the adverse impacts of these areas. Drawing inspiration from our analysis, we then introduce an error-guided feature selection (EGFS) mechanism, in tandem with the use of the Segment Anything Model (SAM). This mechanism seeds low reprojection areas as prompts and expands them into error-guided masks, and then utilizes these masks to sample points and filter out problematic areas in an iterative manner. The experiments demonstrate that our method outperforms existing SCR approaches that do not rely on 3D information on the Cambridge Landmarks and Indoor6 datasets.

Reprojection Errors as Prompts for Efficient Scene Coordinate Regression

TL;DR

The paper addresses SCR-based visual localization, highlighting how dynamic objects and textureless regions hinder training stability and accuracy. It introduces Error-Guided Feature Selection (EGFS) coupled with the Segment Anything Model (SAM) and a confidence refinement mechanism to seed low-reprojection-error prompts, expand them into masks, and iteratively sample robust training regions without relying on fixed semantic labels, all while weighting losses by per-pixel confidence . Empirically, EGFS achieves state-of-the-art or competitive results on Cambridge Landmarks and Indoor6 with smaller model sizes and reduced training time, and ablations confirm the positive contribution of both EGFS and confidence refinement. The method demonstrates practical impact by enabling efficient, robust SCR-based localization in diverse environments, leveraging semantic context through SAM and data-driven focus on reliable regions. Overall, the work advances SCR by integrating error-driven, SEM-aware sampling with confidence-aware optimization, offering a scalable approach for accurate 6-DoF pose estimation in real-world scenarios.

Abstract

Scene coordinate regression (SCR) methods have emerged as a promising area of research due to their potential for accurate visual localization. However, many existing SCR approaches train on samples from all image regions, including dynamic objects and texture-less areas. Utilizing these areas for optimization during training can potentially hamper the overall performance and efficiency of the model. In this study, we first perform an in-depth analysis to validate the adverse impacts of these areas. Drawing inspiration from our analysis, we then introduce an error-guided feature selection (EGFS) mechanism, in tandem with the use of the Segment Anything Model (SAM). This mechanism seeds low reprojection areas as prompts and expands them into error-guided masks, and then utilizes these masks to sample points and filter out problematic areas in an iterative manner. The experiments demonstrate that our method outperforms existing SCR approaches that do not rely on 3D information on the Cambridge Landmarks and Indoor6 datasets.
Paper Structure (33 sections, 2 equations, 8 figures, 5 tables)

This paper contains 33 sections, 2 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Visualization of the primary components (i.e., (d)-(h)) introduced in the proposed visual localization scheme. (d) illustrates the point prompts selected from (b) with low reprojection errors, while (e) presents an error-guided mask expanded from the prompted points in (d) using SAM. (f) displays the proposed error-guided feature selection (EGFS), which refines the mask from (e) with the predicted confidence map (c) to ensure high-quality scene coordinates are sampled for estimating the final camera pose. The point cloud constructed from the predicted scene coordinates is shown on the right-hand side (i.e., (g)-(h)), with the confidence (yellow parts) and the refined EGFS mask (green for selected areas; red for rejected areas).
  • Figure 2: Analysis between reprojection error and semantic meaning. The analysis result indicates the regions with low reprojection errors tend to have higher inlier ratios, while the errors do not always align with specific semantic categories, e.g., "tree” and "rug”.
  • Figure 3: An overview of the training framework.
  • Figure 4: An overview of the inference procedure.
  • Figure 5: Visualization of the EGFS mask refinement process at every five epochs, which depicts the reprojection errors at the beginning (epoch 5) and the end (epoch 20), as well as the refined error-guided masks used throughout training. The red dots represent low reprojection errors that serve as prompts, while the light green overlay denotes the refined EGFS masks. It can be observed that the EGFS masks enhances over epochs.
  • ...and 3 more figures