Table of Contents
Fetching ...

Robust Shape Fitting for 3D Scene Abstraction

Florian Kluger, Eric Brachmann, Michael Ying Yang, Bodo Rosenhahn

TL;DR

This work tackles robust 3D scene abstraction by decomposing cluttered environments into cuboid primitives fitted to depth-derived features. It introduces occlusion-aware distances and a robust, learning-guided RANSAC framework with multiple sampling weight sets to recover multiple scene structures, plus two cuboid solvers (a neural one for speed and a numerical one for precision) enabling end-to-end training without cuboid annotations. The method supports RGB and depth inputs and is validated on NYU Depth v2 and Synthetic Metropolis Homographies, showing improved parsimony and competitive reconstruction fidelity through extensive ablations. Overall, the approach yields meaningful, compact scene abstractions suitable for CAD-like modeling and layout estimation, advancing primitive-based scene understanding in real-world 2.5D data.

Abstract

Humans perceive and construct the world as an arrangement of simple parametric models. In particular, we can often describe man-made environments using volumetric primitives such as cuboids or cylinders. Inferring these primitives is important for attaining high-level, abstract scene descriptions. Previous approaches for primitive-based abstraction estimate shape parameters directly and are only able to reproduce simple objects. In contrast, we propose a robust estimator for primitive fitting, which meaningfully abstracts complex real-world environments using cuboids. A RANSAC estimator guided by a neural network fits these primitives to a depth map. We condition the network on previously detected parts of the scene, parsing it one-by-one. To obtain cuboids from single RGB images, we additionally optimise a depth estimation CNN end-to-end. Naively minimising point-to-primitive distances leads to large or spurious cuboids occluding parts of the scene. We thus propose an improved occlusion-aware distance metric correctly handling opaque scenes. Furthermore, we present a neural network based cuboid solver which provides more parsimonious scene abstractions while also reducing inference time. The proposed algorithm does not require labour-intensive labels, such as cuboid annotations, for training. Results on the NYU Depth v2 dataset demonstrate that the proposed algorithm successfully abstracts cluttered real-world 3D scene layouts.

Robust Shape Fitting for 3D Scene Abstraction

TL;DR

This work tackles robust 3D scene abstraction by decomposing cluttered environments into cuboid primitives fitted to depth-derived features. It introduces occlusion-aware distances and a robust, learning-guided RANSAC framework with multiple sampling weight sets to recover multiple scene structures, plus two cuboid solvers (a neural one for speed and a numerical one for precision) enabling end-to-end training without cuboid annotations. The method supports RGB and depth inputs and is validated on NYU Depth v2 and Synthetic Metropolis Homographies, showing improved parsimony and competitive reconstruction fidelity through extensive ablations. Overall, the approach yields meaningful, compact scene abstractions suitable for CAD-like modeling and layout estimation, advancing primitive-based scene understanding in real-world 2.5D data.

Abstract

Humans perceive and construct the world as an arrangement of simple parametric models. In particular, we can often describe man-made environments using volumetric primitives such as cuboids or cylinders. Inferring these primitives is important for attaining high-level, abstract scene descriptions. Previous approaches for primitive-based abstraction estimate shape parameters directly and are only able to reproduce simple objects. In contrast, we propose a robust estimator for primitive fitting, which meaningfully abstracts complex real-world environments using cuboids. A RANSAC estimator guided by a neural network fits these primitives to a depth map. We condition the network on previously detected parts of the scene, parsing it one-by-one. To obtain cuboids from single RGB images, we additionally optimise a depth estimation CNN end-to-end. Naively minimising point-to-primitive distances leads to large or spurious cuboids occluding parts of the scene. We thus propose an improved occlusion-aware distance metric correctly handling opaque scenes. Furthermore, we present a neural network based cuboid solver which provides more parsimonious scene abstractions while also reducing inference time. The proposed algorithm does not require labour-intensive labels, such as cuboid annotations, for training. Results on the NYU Depth v2 dataset demonstrate that the proposed algorithm successfully abstracts cluttered real-world 3D scene layouts.
Paper Structure (70 sections, 56 equations, 19 figures, 8 tables)

This paper contains 70 sections, 56 equations, 19 figures, 8 tables.

Figures (19)

  • Figure 1: Primitive-based Scene Abstractions: We parse images of real-world scenes (a) and generate abstractions of their 3D structure using cuboids (c). Our method captures scene structure more accurately than previous work paschalidou2019superquadrics using superquadrics (b).
  • Figure 2: Overview: Given observations $\mathcal{X}$ (RGB image), we predict 3D features $\mathcal{Y}$ (depth map) using a neural network with parameters $\mathbf{v}$. Conditioned on a state $\mathbf{s}$, a second neural network with parameters $\mathbf{w}$ predicts sampling weights $p(\mathbf{y}|\mathbf{s}; \mathbf{w}) \in \mathcal{Q}$ for each feature $\mathbf{y} \in \mathcal{Y}$. Using these weights, a RANSAC-based estimator samples minimal sets of features, and generates primitive (cuboid) hypotheses $\mathcal{H}$. It selects the best hypothesis $\mathbf{\hat{h}} \in \mathcal{H}$ and appends it to the set of previously recovered primitives $\mathcal{M}$. We update the state $\mathbf{s}$ based on $\mathcal{M}$ and repeat the process in order to recover all primitives step-by-step.
  • Figure 3: Occlusion: Given are a point cloud (✕), two cuboids (A and B) and a camera observing the scene. Cuboid A is a better fit since it does not occlude any points.
  • Figure 4: Sampling and Fitting: (1) We sample minimal sets of features $\mathcal{S} \subset \mathcal{Y}$ using sampling weights $\mathcal{Q}$ (Sec. \ref{['par:sampling']}). (2) The solver $f_h$ (Sec. \ref{['subsubsec:fitting']}-\ref{['subsubsec:fitting_neural']}, Fig. \ref{['fig:solvers']}) computes cuboid parameters $\mathbf{h}$ from $\mathcal{S}$. (3) We compute multiple cuboid hypotheses concurrently, resulting in a set of hypotheses $\mathcal{H}$. (4) Using occlusion-aware inlier counting (Sec. \ref{['subsec:occlusion_inlier']}), we select the best hypothesis $\mathbf{\hat{h}}$ and (5) add it to the output set of recovered cuboids $\mathcal{M}$.
  • Figure 5: Sampling Weights: We predict multiple sets of sampling weights $\mathcal{Q} = \{\mathbf{p}_1, \dots, \mathbf{p}_Q\}$ and corresponding selection probabilities $\mathbf{q} = [q_1, \dots, q_Q]$. In this example, the first three sampling weight sets roughly cover distinct parts of the scene. The fourth set $\mathbf{p}_4$ does not, but also has the lowest selection probability.
  • ...and 14 more figures