Table of Contents
Fetching ...

Neural Refinement for Absolute Pose Regression with Feature Synthesis

Shuai Chen, Yash Bhalgat, Xinghui Li, Jiawang Bian, Kejie Li, Zirui Wang, Victor Adrian Prisacariu

TL;DR

This work addresses the gap in Absolute Pose Regression (APR) where pose predictions rely on 2D inference without strong geometry priors. It introduces a test-time refinement framework that leverages an implicit 3D feature field via a Neural Feature Synthesizer (NeFeS) to render dense novel-view features and optimize a feature-metric loss $L_{feature}$ to refine the pose. A progressive training strategy and a Feature Fusion module enhance the robustness of the rendered features, enabling end-to-end backpropagation that improves APR without extra unlabeled data. Across Cambridge Landmarks and 7-Scenes, the method delivers state-of-the-art single-image APR accuracy across multiple backbones, illustrating a practical middle ground between APR and full geometry-based localization with favorable efficiency.

Abstract

Absolute Pose Regression (APR) methods use deep neural networks to directly regress camera poses from RGB images. However, the predominant APR architectures only rely on 2D operations during inference, resulting in limited accuracy of pose estimation due to the lack of 3D geometry constraints or priors. In this work, we propose a test-time refinement pipeline that leverages implicit geometric constraints using a robust feature field to enhance the ability of APR methods to use 3D information during inference. We also introduce a novel Neural Feature Synthesizer (NeFeS) model, which encodes 3D geometric features during training and directly renders dense novel view features at test time to refine APR methods. To enhance the robustness of our model, we introduce a feature fusion module and a progressive training strategy. Our proposed method achieves state-of-the-art single-image APR accuracy on indoor and outdoor datasets.

Neural Refinement for Absolute Pose Regression with Feature Synthesis

TL;DR

This work addresses the gap in Absolute Pose Regression (APR) where pose predictions rely on 2D inference without strong geometry priors. It introduces a test-time refinement framework that leverages an implicit 3D feature field via a Neural Feature Synthesizer (NeFeS) to render dense novel-view features and optimize a feature-metric loss to refine the pose. A progressive training strategy and a Feature Fusion module enhance the robustness of the rendered features, enabling end-to-end backpropagation that improves APR without extra unlabeled data. Across Cambridge Landmarks and 7-Scenes, the method delivers state-of-the-art single-image APR accuracy across multiple backbones, illustrating a practical middle ground between APR and full geometry-based localization with favorable efficiency.

Abstract

Absolute Pose Regression (APR) methods use deep neural networks to directly regress camera poses from RGB images. However, the predominant APR architectures only rely on 2D operations during inference, resulting in limited accuracy of pose estimation due to the lack of 3D geometry constraints or priors. In this work, we propose a test-time refinement pipeline that leverages implicit geometric constraints using a robust feature field to enhance the ability of APR methods to use 3D information during inference. We also introduce a novel Neural Feature Synthesizer (NeFeS) model, which encodes 3D geometric features during training and directly renders dense novel view features at test time to refine APR methods. To enhance the robustness of our model, we introduce a feature fusion module and a progressive training strategy. Our proposed method achieves state-of-the-art single-image APR accuracy on indoor and outdoor datasets.
Paper Structure (27 sections, 7 equations, 8 figures, 12 tables)

This paper contains 27 sections, 7 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Our pose refinement ($\mathcal{R}$) improves (coarse) pose predictions from other methods using novel feature synthesis to achieve pixel-wise alignment. Top left / right: 3D plots of predicted (green) and ground-truth (red) camera positions. Bottom left / right: alignment between rendered features and query image.
  • Figure 2: Illustration of the pose refinement pipeline. The query image is processed by a pose estimator $\mathcal{F}$, typically an absolute pose regressor, to obtain a coarse camera pose $\hat{P}$. Our novel feature synthesizer $\mathcal{N}$ renders a dense feature map $f^{rend}$ based on $\hat{P}$. Simultaneously, the feature extractor $\mathcal{G}$ extracts the feature map $f^{G}$ from the query image. We then compute the feature-metric error between $f^{rend}$ and $f^{G}$, denoted as $\mathcal{L}_{feature}$. This error is backpropagated to update either the parameters of $\mathcal{F}$ or the coarse pose $\hat{P}$ directly.
  • Figure 3: The architecture of our proposed NeFeS model. The query 3D position $\mathbf{x}$ is fed to the network after positional encoding $PE(\cdot)$. The network then splits into two heads: the static head and the transient head. Given a viewing direction $\mathbf{d}$, the rendered color map is generated by fusing static RGB value $c^{(s)}_{i}$, the transient RGB value $c^{\tau}_{i}$ and their corresponding density values $\sigma^{(s)}_{i}$ and $\sigma^{\tau}_{i}$, while the rendered feature map is formed only by static features $\mathbf{f}_{i}$ and density $\sigma^{(s)}_{i}$. In addition, the color map adopts exposure-adaptive ACT to compensate for exposure differences between images. The final feature map $\hat{\mathbf{F}}_{fusion}$ is the concatenation of rendered RGB and feature map processed by the feature fusion module.
  • Figure 4: Qualitative comparison between the NeRFs trained by dSLAM GT pose (a) vs. SfM GT pose (b). As illustrated, SfM NeRF (PSNR 19.94 dB) can render superior geometric details (bottom row) than dSLAM NeRF (PSNR 16.11 dB).
  • Figure 5: Experiments on pose refinement bounds of our method in indoor and outdoor scenes. Each plot shows errors before (x-axis) and after (y-axis) refinement when ground-truth pose is perturbed by varying magnitudes. Dashed green line is '$y\!\!=\!\!x$'. Points below this line indicate a reduction in pose error using our refinement method.
  • ...and 3 more figures