Robust Shape Fitting for 3D Scene Abstraction
Florian Kluger, Eric Brachmann, Michael Ying Yang, Bodo Rosenhahn
TL;DR
This work tackles robust 3D scene abstraction by decomposing cluttered environments into cuboid primitives fitted to depth-derived features. It introduces occlusion-aware distances and a robust, learning-guided RANSAC framework with multiple sampling weight sets to recover multiple scene structures, plus two cuboid solvers (a neural one for speed and a numerical one for precision) enabling end-to-end training without cuboid annotations. The method supports RGB and depth inputs and is validated on NYU Depth v2 and Synthetic Metropolis Homographies, showing improved parsimony and competitive reconstruction fidelity through extensive ablations. Overall, the approach yields meaningful, compact scene abstractions suitable for CAD-like modeling and layout estimation, advancing primitive-based scene understanding in real-world 2.5D data.
Abstract
Humans perceive and construct the world as an arrangement of simple parametric models. In particular, we can often describe man-made environments using volumetric primitives such as cuboids or cylinders. Inferring these primitives is important for attaining high-level, abstract scene descriptions. Previous approaches for primitive-based abstraction estimate shape parameters directly and are only able to reproduce simple objects. In contrast, we propose a robust estimator for primitive fitting, which meaningfully abstracts complex real-world environments using cuboids. A RANSAC estimator guided by a neural network fits these primitives to a depth map. We condition the network on previously detected parts of the scene, parsing it one-by-one. To obtain cuboids from single RGB images, we additionally optimise a depth estimation CNN end-to-end. Naively minimising point-to-primitive distances leads to large or spurious cuboids occluding parts of the scene. We thus propose an improved occlusion-aware distance metric correctly handling opaque scenes. Furthermore, we present a neural network based cuboid solver which provides more parsimonious scene abstractions while also reducing inference time. The proposed algorithm does not require labour-intensive labels, such as cuboid annotations, for training. Results on the NYU Depth v2 dataset demonstrate that the proposed algorithm successfully abstracts cluttered real-world 3D scene layouts.
