Table of Contents
Fetching ...

UnLoc: Leveraging Depth Uncertainties for Floorplan Localization

Matthias Wüest, Francis Engelmann, Ondrej Miksik, Marc Pollefeys, Daniel Barath

Abstract

We propose UnLoc, an efficient data-driven solution for sequential camera localization within floorplans. Floorplan data is readily available, long-term persistent, and robust to changes in visual appearance. We address key limitations of recent methods, such as the lack of uncertainty modeling in depth predictions and the necessity for custom depth networks trained for each environment. We introduce a novel probabilistic model that incorporates uncertainty estimation, modeling depth predictions as explicit probability distributions. By leveraging off-the-shelf pre-trained monocular depth models, we eliminate the need to rely on per-environment-trained depth networks, enhancing generalization to unseen spaces. We evaluate UnLoc on large-scale synthetic and real-world datasets, demonstrating significant improvements over existing methods in terms of accuracy and robustness. Notably, we achieve $2.7$ times higher localization recall on long sequences (100 frames) and $42.2$ times higher on short ones (15 frames) than the state of the art on the challenging LaMAR HGE dataset.

UnLoc: Leveraging Depth Uncertainties for Floorplan Localization

Abstract

We propose UnLoc, an efficient data-driven solution for sequential camera localization within floorplans. Floorplan data is readily available, long-term persistent, and robust to changes in visual appearance. We address key limitations of recent methods, such as the lack of uncertainty modeling in depth predictions and the necessity for custom depth networks trained for each environment. We introduce a novel probabilistic model that incorporates uncertainty estimation, modeling depth predictions as explicit probability distributions. By leveraging off-the-shelf pre-trained monocular depth models, we eliminate the need to rely on per-environment-trained depth networks, enhancing generalization to unseen spaces. We evaluate UnLoc on large-scale synthetic and real-world datasets, demonstrating significant improvements over existing methods in terms of accuracy and robustness. Notably, we achieve times higher localization recall on long sequences (100 frames) and times higher on short ones (15 frames) than the state of the art on the challenging LaMAR HGE dataset.

Paper Structure

This paper contains 19 sections, 9 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: UnLoc processes an input image sequence to predict the floorplan depth (in meters) and associated uncertainty for each image column. Using these predictions, it generates a probability distribution over potential $\text{SE}(2)$ camera poses and outputs the most likely one (blue arrow). The ground truth pose is also shown (red arrow), overlapped by the predicted pose.
  • Figure 2: Main method overview. At timestep $t$, UnLoc aligns an image with gravity and processes it through a monodepth encoder. The extracted features, along with a binary mask from the gravity alignment, are used to predict the floorplan depth $\mathbf{\hat{d}}_t$ and uncertainty $\mathbf{\hat{b}}_t$ via masked attentions. These predictions form equiangular rays, allowing for uncertainty-aware matching with the floorplan's occupancy map. A histogram filter fuses the observation likelihood with the integrated past belief.
  • Figure 3: Floorplan depth predictions (in meters) for images from the LaMAR HGE dataset. Top: input images. Bottom: depth predictions by F$^3$Loc (red) and our proposed UnLoc (blue), with predicted uncertainties visualized. The horizontal axis represents the image column index, ranging from left (0) to right (image width $w$). A gray dotted line indicates the ground truth depth.
  • Figure 4: Efficiency Analysis. Performance of sequential localization versus model size (left) and runtime (right) on an NVIDIA Quadro RTX 6000 GPU. The success rate (SR) is defined as the percentage of sequences of length $T=25$ for which the posterior remains within an error radius of $1$m in the 10 frames. The values are averaged over all test sequences from the Gibson(t) dataset.
  • Figure 5: Observation likelihoods.
  • ...and 1 more figures