Deep Polycuboid Fitting for Compact 3D Representation of Indoor Scenes
Gahye Lee, Hyejeong Yoon, Jungeon Kim, Seungyong Lee
TL;DR
The paper introduces a deep learning framework to represent indoor scenes as compact polycuboids by first labeling cuboid faces with a transformer and then using a graph neural network to infer spatial relations among faces. Detected faces are aggregated into polycuboid instances and reconstructed as rectilinear meshes, enabling lightweight, editable scene representations. A synthetic polycuboid dataset trains the networks, with demonstrated generalization to real data from ScanNet, Replica, and iPhone captures, supporting applications such as virtual room tours and scene editing. The approach achieves faithful geometry capture with substantially fewer primitives than dense meshes, offering efficient, controllable scene abstractions for downstream AR/VR tasks.
Abstract
This paper presents a novel framework for compactly representing a 3D indoor scene using a set of polycuboids through a deep learning-based fitting method. Indoor scenes mainly consist of man-made objects, such as furniture, which often exhibit rectilinear geometry. This property allows indoor scenes to be represented using combinations of polycuboids, providing a compact representation that benefits downstream applications like furniture rearrangement. Our framework takes a noisy point cloud as input and first detects six types of cuboid faces using a transformer network. Then, a graph neural network is used to validate the spatial relationships of the detected faces to form potential polycuboids. Finally, each polycuboid instance is reconstructed by forming a set of boxes based on the aggregated face labels. To train our networks, we introduce a synthetic dataset encompassing a diverse range of cuboid and polycuboid shapes that reflect the characteristics of indoor scenes. Our framework generalizes well to real-world indoor scene datasets, including Replica, ScanNet, and scenes captured with an iPhone. The versatility of our method is demonstrated through practical applications, such as virtual room tours and scene editing.
