Table of Contents
Fetching ...

F$^3$Loc: Fusion and Filtering for Floorplan Localization

Changan Chen, Rui Wang, Christoph Vogel, Marc Pollefeys

TL;DR

This work tackles indoor camera localization with respect to floorplans without requiring per-map retraining or large image databases. It combines monocular and multi-view floorplan depth predictions through a learned complementary selector and integrates evidence over time with an efficient SE(2) histogram filter, enabling real-time sequential localization on consumer hardware. Key contributions include a novel 1D ray floorplan representation, depth extraction from single and multi-view inputs, a learned fusion mechanism, virtual roll-pitch augmentation, an SE(2) histogram filter for rapid sequential inference, and a large iGibson-based dataset plus a real-world LaMAR demonstration showing scalable, accurate localization. The results show significant improvements in recall and localization speed over state-of-the-art methods, with practical implications for indoor AR/VR and robot autonomy.

Abstract

In this paper we propose an efficient data-driven solution to self-localization within a floorplan. Floorplan data is readily available, long-term persistent and inherently robust to changes in the visual appearance. Our method does not require retraining per map and location or demand a large database of images of the area of interest. We propose a novel probabilistic model consisting of an observation and a novel temporal filtering module. Operating internally with an efficient ray-based representation, the observation module consists of a single and a multiview module to predict horizontal depth from images and fuses their results to benefit from advantages offered by either methodology. Our method operates on conventional consumer hardware and overcomes a common limitation of competing methods that often demand upright images. Our full system meets real-time requirements, while outperforming the state-of-the-art by a significant margin.

F$^3$Loc: Fusion and Filtering for Floorplan Localization

TL;DR

This work tackles indoor camera localization with respect to floorplans without requiring per-map retraining or large image databases. It combines monocular and multi-view floorplan depth predictions through a learned complementary selector and integrates evidence over time with an efficient SE(2) histogram filter, enabling real-time sequential localization on consumer hardware. Key contributions include a novel 1D ray floorplan representation, depth extraction from single and multi-view inputs, a learned fusion mechanism, virtual roll-pitch augmentation, an SE(2) histogram filter for rapid sequential inference, and a large iGibson-based dataset plus a real-world LaMAR demonstration showing scalable, accurate localization. The results show significant improvements in recall and localization speed over state-of-the-art methods, with practical implications for indoor AR/VR and robot autonomy.

Abstract

In this paper we propose an efficient data-driven solution to self-localization within a floorplan. Floorplan data is readily available, long-term persistent and inherently robust to changes in the visual appearance. Our method does not require retraining per map and location or demand a large database of images of the area of interest. We propose a novel probabilistic model consisting of an observation and a novel temporal filtering module. Operating internally with an efficient ray-based representation, the observation module consists of a single and a multiview module to predict horizontal depth from images and fuses their results to benefit from advantages offered by either methodology. Our method operates on conventional consumer hardware and overcomes a common limitation of competing methods that often demand upright images. Our full system meets real-time requirements, while outperforming the state-of-the-art by a significant margin.
Paper Structure (17 sections, 11 equations, 11 figures, 4 tables)

This paper contains 17 sections, 11 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Floorplan localization. We propose a novel probabilistic model for localization within a floorplan consisting of a data-driven observation (a,b) and a temporal filtering module (c). Evidence is estimated as a 1D-range image from a single (a) and a few consecutive RGB images (b). A learned soft selection module combines the output from the complementary cues. The observation likelihood is integrated over time by an efficient SE2 histogram filter to deliver the pose posterior. Our system achieves rapid and accurate sequential localization, outperforming the state-of-the-art in recall and localization speed, while operating on consumer hardware.
  • Figure 2: Pipeline overview. Our pipeline adopts a monocular (\ref{['sec:mono']}) and a multi-view network (\ref{['sec:mv']}) to predict floorplan depth. A selection network (\ref{['sec:select']}) consolidates both predictions based on the relative poses. The resulting floorplan depth is used in our observation model and integrated over time by our novel SE(2) histogram filter (\ref{['sec:histogram']}) to perform sequential floorplan localization.
  • Figure 3: Predicting and localizing with a single image. (a) A gravity aligned image is fed into the ResNet resnet and Attention attention based feature network. Invisible pixels are masked out in the attention. The network outputs a probability distribution over depth hypotheses and its expectation is used as predicted floorplan depth. (b) Equiangular rays are interpolated from the predicted floorplan depth. We localize by finding the pose in the floorplan that has the most similar rays as the prediction.
  • Figure 4: Floorplan depth prediction from multiple views. Column features of the images are extracted and gathered in the reference frame. Their cross-view feature variance is used as cost. A U-Net-like network learns the cost filtering to form a probability distribution, and the floorplan depth is defined by its expectation.
  • Figure 5: Transition as grouped convolution. (a) Illustration of the translational filters (from left to right, top to bottom the filters for 0, 10 to 350$\degree$ ) and the rotational filters derived from a sample ego-motion. (b) The probability volume is divided into $O$ groups, where $O$ is the number of orientations. Each group is convolved with its respective translational filter and stacked back together. After circular padding along the orientation axis, the volume is convolved with the rotational filter to finish the transition step.
  • ...and 6 more figures