Learning Neural Volumetric Pose Features for Camera Localization

Jingyu Lin; Jiaqi Gu; Bojian Wu; Lubin Fan; Renjie Chen; Ligang Liu; Jieping Ye

Learning Neural Volumetric Pose Features for Camera Localization

Jingyu Lin, Jiaqi Gu, Bojian Wu, Lubin Fan, Renjie Chen, Ligang Liu, Jieping Ye

TL;DR

This work tackles camera localization by addressing the limitations of Absolute Pose Regression (APR) through a neural volumetric pose feature called PoseMap. PoseMap is learned by augmenting NeRF with a dedicated pose branch (NeRF-P) and is trained alongside an APRNet to produce discriminate pose representations and enable novel-view synthesis for data augmentation. The authors introduce a self-supervised online alignment mechanism that leverages unlabelled images to further refine pose features, achieving averages gains of approximately 14.28% in translation and 20.51% in rotation on indoor and outdoor benchmarks, and delivering state-of-the-art APR performance on challenging datasets. Overall, PoseMap demonstrates that neural volumetric features can encode implicit pose information, enabling robust, data-efficient camera localization and paving the way for further integration with structure-based cues. $SE(3)$ poses are regressed in the framework, and the approach benefits from online self-supervision and novel-view synthesis to enhance generalization and accuracy.

Abstract

We introduce a novel neural volumetric pose feature, termed PoseMap, designed to enhance camera localization by encapsulating the information between images and the associated camera poses. Our framework leverages an Absolute Pose Regression (APR) architecture, together with an augmented NeRF module. This integration not only facilitates the generation of novel views to enrich the training dataset but also enables the learning of effective pose features. Additionally, we extend our architecture for self-supervised online alignment, allowing our method to be used and fine-tuned for unlabelled images within a unified framework. Experiments demonstrate that our method achieves 14.28% and 20.51% performance gain on average in indoor and outdoor benchmark scenes, outperforming existing APR methods with state-of-the-art accuracy.

Learning Neural Volumetric Pose Features for Camera Localization

TL;DR

poses are regressed in the framework, and the approach benefits from online self-supervision and novel-view synthesis to enhance generalization and accuracy.

Abstract

Paper Structure (25 sections, 5 equations, 8 figures, 7 tables)

This paper contains 25 sections, 5 equations, 8 figures, 7 tables.

Introduction
Related Work
Absolute Pose Regression (APR)
NeRF for APR
Positioning of Our Work
Method
Camera Localization with PoseMap
Neural Volumetric Features of PoseMap
Self-supervised Feature Alignment with PoseMap
Experiments
Datasets and Implementation Details
Evaluation on Datasets
Ablation Studies
Discussions
Conclusion
...and 10 more sections

Figures (8)

Figure 1: The generation of PoseMap. To capture the implicit pose characteristics, we enhance the original NeRF by introducing a unique pose embedding. Subsequently, we generate a PoseMap through volumetric rendering. We believe that the learned volumetric features integrate the implicit information of camera pose and can be used to improve the accuracy of camera localization tasks.
Figure 2: Overview of the camera localization pipeline with PoseMap. The training stage of our pipeline, including two main modules: APRNet for camera pose regression and extracting image features and NeRF-P for view synthesis and extracting pose features. The inference stage of our pipeline with a simple APRNet for fast inference is highlighted in blue dotted box.
Figure 3: Self-supervised online feature alignment scheme. We keep the $\mathcal{L}_{image}$ and $\mathcal{L}_{posemap}$ in the self-supervised pipeline. This scheme is suitable for any unlabelled images or images from the internet without matching with the 3D SfM model.
Figure 4: Visual comparison of camera localization between DFNet$_{dm}$ (top) and our method (bottom) on 7-scenes dataset. For each plot, we show the ground truth camera trajectory in green and the estimated trajectory in red. The color bar under each plot shows rotation errors. Yellow represents high rotation error, and blue represents low rotation error. Sequence names from left to right are: office-seq7, chess-seq3, fire-seq3 and kitchen-seq4.
Figure 5: Visualization of localization results. From left to right, we show the input real image (left), the rendered image of the pose estimated by PMNet$_{ud}$ (2nd column), APR feature map (3rd column), and our PoseMap (right). The dimensionality reduction via PCA is utilized to visualize the PoseMap with pseudo color.
...and 3 more figures

Learning Neural Volumetric Pose Features for Camera Localization

TL;DR

Abstract

Learning Neural Volumetric Pose Features for Camera Localization

Authors

TL;DR

Abstract

Table of Contents

Figures (8)