Table of Contents
Fetching ...

WSCLoc: Weakly-Supervised Sparse-View Camera Relocalization

Jialu Wang, Kaichen Zhou, Andrew Markham, Niki Trigoni

Abstract

Despite the advancements in deep learning for camera relocalization tasks, obtaining ground truth pose labels required for the training process remains a costly endeavor. While current weakly supervised methods excel in lightweight label generation, their performance notably declines in scenarios with sparse views. In response to this challenge, we introduce WSCLoc, a system capable of being customized to various deep learning-based relocalization models to enhance their performance under weakly-supervised and sparse view conditions. This is realized with two stages. In the initial stage, WSCLoc employs a multilayer perceptron-based structure called WFT-NeRF to co-optimize image reconstruction quality and initial pose information. To ensure a stable learning process, we incorporate temporal information as input. Furthermore, instead of optimizing SE(3), we opt for $\mathfrak{sim}(3)$ optimization to explicitly enforce a scale constraint. In the second stage, we co-optimize the pre-trained WFT-NeRF and WFT-Pose. This optimization is enhanced by Time-Encoding based Random View Synthesis and supervised by inter-frame geometric constraints that consider pose, depth, and RGB information. We validate our approaches on two publicly available datasets, one outdoor and one indoor. Our experimental results demonstrate that our weakly-supervised relocalization solutions achieve superior pose estimation accuracy in sparse-view scenarios, comparable to state-of-the-art camera relocalization methods. We will make our code publicly available.

WSCLoc: Weakly-Supervised Sparse-View Camera Relocalization

Abstract

Despite the advancements in deep learning for camera relocalization tasks, obtaining ground truth pose labels required for the training process remains a costly endeavor. While current weakly supervised methods excel in lightweight label generation, their performance notably declines in scenarios with sparse views. In response to this challenge, we introduce WSCLoc, a system capable of being customized to various deep learning-based relocalization models to enhance their performance under weakly-supervised and sparse view conditions. This is realized with two stages. In the initial stage, WSCLoc employs a multilayer perceptron-based structure called WFT-NeRF to co-optimize image reconstruction quality and initial pose information. To ensure a stable learning process, we incorporate temporal information as input. Furthermore, instead of optimizing SE(3), we opt for optimization to explicitly enforce a scale constraint. In the second stage, we co-optimize the pre-trained WFT-NeRF and WFT-Pose. This optimization is enhanced by Time-Encoding based Random View Synthesis and supervised by inter-frame geometric constraints that consider pose, depth, and RGB information. We validate our approaches on two publicly available datasets, one outdoor and one indoor. Our experimental results demonstrate that our weakly-supervised relocalization solutions achieve superior pose estimation accuracy in sparse-view scenarios, comparable to state-of-the-art camera relocalization methods. We will make our code publicly available.
Paper Structure (28 sections, 7 equations, 4 figures, 5 tables)

This paper contains 28 sections, 7 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: WSCLoc System Workflow. In the WFT-NeRF Stage (left), time encodings are generated for each image, and initial pose labels are obtained during the simultaneous training of the WFT-NeRF model. In the WFT-Pose Stage (right), the training set is augmented using TE-based RVS (not shown in the figure). Consecutive frames are then fed into the target relocalization model in each iteration to calculate the pose loss and inter-frame geometric constraint loss. Finally, the relocalization model is trained by minimizing the overall loss.
  • Figure 2: Structure of WFT-NeRF. During video capture, reference images are encoded with temporal information ($t_i$) using discrete time indices to minimize motion-related blurring. Grayscale levels in YUV are encoded for consistent exposure and appearance. Our NeRF training involves three sets of MLPs: 1. The base network estimates volume density and hidden state after coarse ray sampling. 2. Middle MLPs perform fine-ray sampling for appearance, estimating density and color. 3. Top MLPs handle fine-ray sampling for transient properties, estimating density, color, and uncertainty to filter transient objects. Losses between the rendered and reference images optimize pose during backpropagation, simultaneously optimizing NeRF and $\mathfrak{sim}(3)$ poses. Only the base network is used for testing.
  • Figure 3: Qualitative Comparison. Large-scale or free-trajectories can introduce camera rolling shutter effects and motion blur, causing artifacts during NeRF model training and resulting in incorrect pose estimations. We mitigate this by explicitly encoding time indices for input images, rectifying deformed ground truth images, and enhancing robustness to motion blur with sharper object boundaries.
  • Figure 4: Qualitative Evaluation of WFT-NeRF Performance (Hospital scene in the large-scale Cambridge dataset under 10% sparse-view conditions). Here, we demonstrate the impact of our full WFT-NeRF model benefiting from explicit scale constraint (SC) and Time Encoding (TE). Removing TE alone results in less clear image boundaries. Removing SC only leads to blurry images due to noisy pose labels generated by SfM in sparse-view scenarios. Removing both TE and SC severely degrades the quality of rendered images.