Table of Contents
Fetching ...

FlatLands: Generative Floormap Completion From a Single Egocentric View

Subhransu S. Bhattacharjee, Dylan Campbell, Rahul Shome

Abstract

A single egocentric image typically captures only a small portion of the floor, yet a complete metric traversability map of the surroundings would better serve applications such as indoor navigation. We introduce FlatLands, a dataset and benchmark for single-view bird's-eye view (BEV) floor completion. The dataset contains 270,575 observations from 17,656 real metric indoor scenes drawn from six existing datasets, with aligned observation, visibility, validity, and ground-truth BEV maps, and the benchmark includes both in- and out-of-distribution evaluation protocols. We compare training-free approaches, deterministic models, ensembles, and stochastic generative models. Finally, we instantiate the task as an end-to-end monocular RGB-to-floormaps pipeline. FlatLands provides a rigorous testbed for uncertainty-aware indoor mapping and generative completion for embodied navigation.

FlatLands: Generative Floormap Completion From a Single Egocentric View

Abstract

A single egocentric image typically captures only a small portion of the floor, yet a complete metric traversability map of the surroundings would better serve applications such as indoor navigation. We introduce FlatLands, a dataset and benchmark for single-view bird's-eye view (BEV) floor completion. The dataset contains 270,575 observations from 17,656 real metric indoor scenes drawn from six existing datasets, with aligned observation, visibility, validity, and ground-truth BEV maps, and the benchmark includes both in- and out-of-distribution evaluation protocols. We compare training-free approaches, deterministic models, ensembles, and stochastic generative models. Finally, we instantiate the task as an end-to-end monocular RGB-to-floormaps pipeline. FlatLands provides a rigorous testbed for uncertainty-aware indoor mapping and generative completion for embodied navigation.
Paper Structure (83 sections, 8 equations, 19 figures, 16 tables)

This paper contains 83 sections, 8 equations, 19 figures, 16 tables.

Figures (19)

  • Figure 1: Pipeline. From a single RGB image, our model predicts depth and floor segmentation and projects them to BEV, producing observed floor $F_{\text{obs}}$ and unobserved mask $U$. A conditional generator then predicts floormap completions in the unobserved region, while preserving observed evidence.
  • Figure 2: FlatLands dataset statistics and construction.
  • Figure 3: Input egocentric RGB (left) and the four aligned $256{\times}256$ binary maps per observation. $F_{\text{obs}}$: observed floor; $U$: valid unobserved; $F^{\star}$: full floor ground truth; $V$: valid workspace. The white marker ($\blacktriangledown$) denotes the fixed camera anchor in BEV.
  • Figure 4: LaMa-Ensemble vs. FM+XAttn on a multi-room ScanNet scene. Row 1: observed floor $F_{\text{obs}}$ and unobserved mask $U$ condition both models; the four LaMa-Ensemble samples (boxed) and their per-pixel variance $\sigma^2$. Row 2: ground-truth floor $F^{\star}$ and validity mask $V$ used for evaluation; four FM+XAttn samples (boxed) and their $\sigma^2$. LaMa-Ensemble spreads variance uniformly; FM+XAttn concentrates it at layout boundaries.
  • Figure 5: Qualitative results on the test split.Top: deterministic single-output comparison across three scenes (in-distribution rows 1--2, out-of-distribution row 3). Columns show the observed floor $F_{\text{obs}}$, unobserved mask $U$, ground truth, and predictions from each baseline. These BEV observations are geometrically projected from the 3D mesh and do not involve any RGB input. Bottom: four independent samples, drawn from each stochastic generator for one in-distribution scene, alongside the per-pixel variance $\sigma^2$ (brighter $=$ higher disagreement).
  • ...and 14 more figures