Table of Contents
Fetching ...

Deep Polycuboid Fitting for Compact 3D Representation of Indoor Scenes

Gahye Lee, Hyejeong Yoon, Jungeon Kim, Seungyong Lee

TL;DR

The paper introduces a deep learning framework to represent indoor scenes as compact polycuboids by first labeling cuboid faces with a transformer and then using a graph neural network to infer spatial relations among faces. Detected faces are aggregated into polycuboid instances and reconstructed as rectilinear meshes, enabling lightweight, editable scene representations. A synthetic polycuboid dataset trains the networks, with demonstrated generalization to real data from ScanNet, Replica, and iPhone captures, supporting applications such as virtual room tours and scene editing. The approach achieves faithful geometry capture with substantially fewer primitives than dense meshes, offering efficient, controllable scene abstractions for downstream AR/VR tasks.

Abstract

This paper presents a novel framework for compactly representing a 3D indoor scene using a set of polycuboids through a deep learning-based fitting method. Indoor scenes mainly consist of man-made objects, such as furniture, which often exhibit rectilinear geometry. This property allows indoor scenes to be represented using combinations of polycuboids, providing a compact representation that benefits downstream applications like furniture rearrangement. Our framework takes a noisy point cloud as input and first detects six types of cuboid faces using a transformer network. Then, a graph neural network is used to validate the spatial relationships of the detected faces to form potential polycuboids. Finally, each polycuboid instance is reconstructed by forming a set of boxes based on the aggregated face labels. To train our networks, we introduce a synthetic dataset encompassing a diverse range of cuboid and polycuboid shapes that reflect the characteristics of indoor scenes. Our framework generalizes well to real-world indoor scene datasets, including Replica, ScanNet, and scenes captured with an iPhone. The versatility of our method is demonstrated through practical applications, such as virtual room tours and scene editing.

Deep Polycuboid Fitting for Compact 3D Representation of Indoor Scenes

TL;DR

The paper introduces a deep learning framework to represent indoor scenes as compact polycuboids by first labeling cuboid faces with a transformer and then using a graph neural network to infer spatial relations among faces. Detected faces are aggregated into polycuboid instances and reconstructed as rectilinear meshes, enabling lightweight, editable scene representations. A synthetic polycuboid dataset trains the networks, with demonstrated generalization to real data from ScanNet, Replica, and iPhone captures, supporting applications such as virtual room tours and scene editing. The approach achieves faithful geometry capture with substantially fewer primitives than dense meshes, offering efficient, controllable scene abstractions for downstream AR/VR tasks.

Abstract

This paper presents a novel framework for compactly representing a 3D indoor scene using a set of polycuboids through a deep learning-based fitting method. Indoor scenes mainly consist of man-made objects, such as furniture, which often exhibit rectilinear geometry. This property allows indoor scenes to be represented using combinations of polycuboids, providing a compact representation that benefits downstream applications like furniture rearrangement. Our framework takes a noisy point cloud as input and first detects six types of cuboid faces using a transformer network. Then, a graph neural network is used to validate the spatial relationships of the detected faces to form potential polycuboids. Finally, each polycuboid instance is reconstructed by forming a set of boxes based on the aggregated face labels. To train our networks, we introduce a synthetic dataset encompassing a diverse range of cuboid and polycuboid shapes that reflect the characteristics of indoor scenes. Our framework generalizes well to real-world indoor scene datasets, including Replica, ScanNet, and scenes captured with an iPhone. The versatility of our method is demonstrated through practical applications, such as virtual room tours and scene editing.

Paper Structure

This paper contains 18 sections, 4 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Comparison of shape abstraction results on Replica dataset. Compared to MBF ramamonjisoa2022monteboxfinder, which generates many thin boxes covering only small portions of input points, our method detects high-quality polycuboids that comprehensively abstract the overall shapes of objects.
  • Figure 2: Overall process of our polycuboid fitting framework. Initially, a transformer network estimates point-wise face labels and offsets that are used to detect face segments. Detected face segments are then aggregated into polycuboid instances using spatial relations predicted by a graph convolutional network. Finally, each polycuboid instance is reconstructed as a polycuboid mesh in a coarse or fine detail level.
  • Figure 3: Face labels and their spatial relationships in a cuboid, and their extension to a polycuboid. (a) A cuboid with color-coded face and edge components. (b) Spatial relationship graph of cuboid components, whose nodes and edges correspond to faces and edges of a cuboid, respectively. (c) A polycuboid with an additional adjacency type of concave edge marked in black.
  • Figure 4: Illustration of our polycuboid reconstruction process. From a 3D non-uniform grid fitted to a polycuboid instance, we select valid 3D boxes that are inside the polycuboid based on the inferred face labels.
  • Figure 5: Illustration of our indoor scene reconstruction pipeline. The input point cloud is first separated into layout and object points. These two point sets are independently reconstructed into polycuboid meshes using our polycuboid fitting method, which are then merged to produce the final output.
  • ...and 12 more figures