Table of Contents
Fetching ...

Room Envelopes: A Synthetic Dataset for Indoor Layout Reconstruction from Images

Sam Bahrami, Dylan Campbell

TL;DR

This paper addresses the incomplete nature of indoor scene reconstructions by occlusion, proposing Room Envelopes, a synthetic dataset that provides two per-view pointmaps: a visible-surface map and a first-layout (structural) surface map. This dual representation allows direct supervision for feed-forward monocular layout estimation, leveraging the planar and regular nature of room layouts. Through experiments based on a MoGe backbone and comparisons to MoGe and LaRI, the authors demonstrate improved reconstruction of occluded layout geometry and show strong qualitative results, including in-the-wild images. The dataset and findings offer a practical pathway to more complete indoor geometry understanding, with potential impacts on robotic navigation and augmented reality.

Abstract

Modern scene reconstruction methods are able to accurately recover 3D surfaces that are visible in one or more images. However, this leads to incomplete reconstructions, missing all occluded surfaces. While much progress has been made on reconstructing entire objects given partial observations using generative models, the structural elements of a scene, like the walls, floors and ceilings, have received less attention. We argue that these scene elements should be relatively easy to predict, since they are typically planar, repetitive and simple, and so less costly approaches may be suitable. In this work, we present a synthetic dataset -- Room Envelopes -- that facilitates progress on this task by providing a set of RGB images and two associated pointmaps for each image: one capturing the visible surface and one capturing the first surface once fittings and fixtures are removed, that is, the structural layout. As we show, this enables direct supervision for feed-forward monocular geometry estimators that predict both the first visible surface and the first layout surface. This confers an understanding of the scene's extent, as well as the shape and location of its objects.

Room Envelopes: A Synthetic Dataset for Indoor Layout Reconstruction from Images

TL;DR

This paper addresses the incomplete nature of indoor scene reconstructions by occlusion, proposing Room Envelopes, a synthetic dataset that provides two per-view pointmaps: a visible-surface map and a first-layout (structural) surface map. This dual representation allows direct supervision for feed-forward monocular layout estimation, leveraging the planar and regular nature of room layouts. Through experiments based on a MoGe backbone and comparisons to MoGe and LaRI, the authors demonstrate improved reconstruction of occluded layout geometry and show strong qualitative results, including in-the-wild images. The dataset and findings offer a practical pathway to more complete indoor geometry understanding, with potential impacts on robotic navigation and augmented reality.

Abstract

Modern scene reconstruction methods are able to accurately recover 3D surfaces that are visible in one or more images. However, this leads to incomplete reconstructions, missing all occluded surfaces. While much progress has been made on reconstructing entire objects given partial observations using generative models, the structural elements of a scene, like the walls, floors and ceilings, have received less attention. We argue that these scene elements should be relatively easy to predict, since they are typically planar, repetitive and simple, and so less costly approaches may be suitable. In this work, we present a synthetic dataset -- Room Envelopes -- that facilitates progress on this task by providing a set of RGB images and two associated pointmaps for each image: one capturing the visible surface and one capturing the first surface once fittings and fixtures are removed, that is, the structural layout. As we show, this enables direct supervision for feed-forward monocular geometry estimators that predict both the first visible surface and the first layout surface. This confers an understanding of the scene's extent, as well as the shape and location of its objects.

Paper Structure

This paper contains 14 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Room Envelopes dataset overview. Our synthetic dataset provides dual pointmap representations for indoor scene reconstruction. (a) Overhead floor plan showing the view from a camera capturing layout areas in red, the first visible surface in green, and parts shared in both in orange. (b) The visible surface capturing all directly visible geometry including furniture and objects. (c) The layout surface showing structural elements (walls, floors, ceilings, windows, doors) as they would appear without occlusion. This dual representation enables direct supervision for layout reconstruction in occluded regions. (d--f) Example data from our dataset. (d) The original RGB image from Hypersim. (e) The visible surface depth capturing all visible surfaces including furniture and objects (f) The first layout surface depth showing only structural elements (walls, floors, ceiling, windows, doors).
  • Figure 2: Missing data and occlusion patterns. (a) Original RGB image showing furniture occluding wall surfaces. (b) Corresponding layout depth image with holes (missing data) where no camera view captured the structural elements.
  • Figure 3: Qualitative comparison of 3D first surface layout results as depth images. From left to right: input RGB image, ground truth layout geometry, our fine-tuned model trained on Room Envelopes, LaRI li2025lari. Our method shows superior reconstruction of occluded layout elements. Red boxes highlight regions where our method successfully reconstructs layout geometry that is completely occluded in the input image.
  • Figure 4: Qualitative comparison on in-the-wild images captured by a phone camera comparing MoGe and our layout trained model on real indoor images. Normals are estimated using local surface fitting and colourised by mapping the x, y, z components to red, green, blue channels respectively. Please zoom in to see finer details.