Table of Contents
Fetching ...

PoI: A Filter to Extract Pixel of Interest from Novel Views for Scene Coordinate Regression

Feifei Li, Qi Song, Chi Zhang, Hui Shuai, Rui Huang

TL;DR

PoI (Pixel-of-Interest), a framework that enables effective NVS augmentation for SCR-based localization and proposes a progressive pixel-level filtering strategy based on reprojection error to selectively retain trustworthy synthetic pixels during training while suppressing harmful ones.

Abstract

Neural View Synthesis (NVS) techniques such as NeRF and 3D Gaussian Splatting (3DGS) have enabled photorealistic rendering from novel viewpoints and are increasingly used to augment training data for visual localization. However, these methods fundamentally rely on observed geometry and radiance; they interpolate existing information but cannot hallucinate unseen 3D structures or recover missing content under sparse or extreme viewpoints. As a result, rendered views often exhibit blur, structural distortion, or incomplete geometry. While such imperfections may be tolerated by Camera Pose Regression (CPR) methods, they severely degrade Scene Coordinate Regression (SCR), which requires accurate per-pixel 3D supervision. To address this limitation, we introduce PoI (Pixel-of-Interest), a framework that enables effective NVS augmentation for SCR-based localization. We first employ 3DGS to render novel views and leverage a single-step diffusion model to refine them, allowing the synthesis of structurally plausible details beyond purely geometry-driven interpolation. However, even diffusion-refined views may contain unreliable pixels. Therefore, we propose a progressive pixel-level filtering strategy based on reprojection error to selectively retain trustworthy synthetic pixels during training while suppressing harmful ones. Extensive experiments on 7Scenes and Cambridge Landmarks demonstrate that our method consistently improves localization accuracy over strong SCR baselines and achieves state-of-the-art performance with competitive training efficiency. Our results reveal that, for SCR, the benefit of novel view augmentation depends not only on generative realism but also on explicit control of pixel-level reliability.

PoI: A Filter to Extract Pixel of Interest from Novel Views for Scene Coordinate Regression

TL;DR

PoI (Pixel-of-Interest), a framework that enables effective NVS augmentation for SCR-based localization and proposes a progressive pixel-level filtering strategy based on reprojection error to selectively retain trustworthy synthetic pixels during training while suppressing harmful ones.

Abstract

Neural View Synthesis (NVS) techniques such as NeRF and 3D Gaussian Splatting (3DGS) have enabled photorealistic rendering from novel viewpoints and are increasingly used to augment training data for visual localization. However, these methods fundamentally rely on observed geometry and radiance; they interpolate existing information but cannot hallucinate unseen 3D structures or recover missing content under sparse or extreme viewpoints. As a result, rendered views often exhibit blur, structural distortion, or incomplete geometry. While such imperfections may be tolerated by Camera Pose Regression (CPR) methods, they severely degrade Scene Coordinate Regression (SCR), which requires accurate per-pixel 3D supervision. To address this limitation, we introduce PoI (Pixel-of-Interest), a framework that enables effective NVS augmentation for SCR-based localization. We first employ 3DGS to render novel views and leverage a single-step diffusion model to refine them, allowing the synthesis of structurally plausible details beyond purely geometry-driven interpolation. However, even diffusion-refined views may contain unreliable pixels. Therefore, we propose a progressive pixel-level filtering strategy based on reprojection error to selectively retain trustworthy synthetic pixels during training while suppressing harmful ones. Extensive experiments on 7Scenes and Cambridge Landmarks demonstrate that our method consistently improves localization accuracy over strong SCR baselines and achieves state-of-the-art performance with competitive training efficiency. Our results reveal that, for SCR, the benefit of novel view augmentation depends not only on generative realism but also on explicit control of pixel-level reliability.

Paper Structure

This paper contains 13 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Left: A comparison of query and novel views in the 7Scenes and Cambridge Landmarks datasets highlights notable quality discrepancies. Query frames are typically sharp and structurally consistent, whereas novel frames frequently suffer from blur, missing content, and geometric inconsistencies. Right: Translation error versus training time, where 'CoodiNet+' denotes using rendered images as query images for CPR method CoodiNet (LENS in this case); 'DSAC*+' and 'ACE+' denote the method that combines NVS-rendered images and query images as training data for SCR method DSAC* and ACE. 'ACE+PoI' denotes our proposed PoI method (ACE-based); Analysis reveals that directly adding novel views to the training set will increase training time to some extent, but performance will decrease for the SCR method. On the other hand, our PoI approach can improve the performance with an acceptable time increase.
  • Figure 2: Pipeline of our proposed methods: (a) Data Augmentation: We first sample a group of synthesized camera poses $P_{novel}$ according to the query training pose $P_{query}$ using Fisher Sample. Then, we render the synthesized views $I_{novel}$ from the sampled poses $P_{novel}$ using the novel view synthesis model. (b) Architecture of PoI module: First, a pre-trained scene-irrelevant backbone is applied to extract the features of the input query photos $I_{query}$ and the synthesized novel images $I_{novel}$. Then, the filter is applied to the rendered image features and extracts the features of interest. After that, we combine the query features with the retained features of the novel views and shuffle the pixel-aligned features to get the aggregation. Finally, we estimate the scene coordinates of the pixels using a scene-specific Head. The filtering algorithm is designed based on the reprojection error and the gradient of the estimated scene coordinates.
  • Figure 3: The data augment process under sparse input circumstances.
  • Figure 4: An example of the results of PoI method in dataset 7Scenes and Cambridge Landmarks. To highlight the determined pixels of interest, we scale up the 'Value' (V) of the HSV representation of the images.
  • Figure 5: Visualized results of scene coordinates and localization trajectories.