Table of Contents
Fetching ...

V-PRISM: Probabilistic Mapping of Unknown Tabletop Scenes

Herbert Wright, Weiming Zhi, Matthew Johnson-Roberson, Tucker Hermans

TL;DR

V-PRISM casts 3D tabletop scene reconstruction as a multiclass probabilistic mapping problem $m:\mathbb{R}^d\to[0,1]^c$, using a softmax Bayesian Hilbert map with hinge-point features. It employs an EM-based variational inference over a weight matrix $\boldsymbol W$ and a novel object-centric negative sampling strategy to learn a posterior over classes, yielding both accurate geometry and principled per-voxel uncertainty $\mathbb{E}_{\boldsymbol W}[\text{softmax}(\boldsymbol W\boldsymbol\phi(\boldsymbol x))]$. The system demonstrates robust performance on procedurally generated and real-world tabletop scenes, outperforming a voxel baseline and PointSDF on IoU and Chamfer metrics while providing uncertainty maps that highlight occluded regions. This uncertainty-aware representation supports safer, more reliable manipulation and planning in environments with unseen objects, and paves the way for active learning and dynamic tabletop scene handling.

Abstract

The ability to construct concise scene representations from sensor input is central to the field of robotics. This paper addresses the problem of robustly creating a 3D representation of a tabletop scene from a segmented RGB-D image. These representations are then critical for a range of downstream manipulation tasks. Many previous attempts to tackle this problem do not capture accurate uncertainty, which is required to subsequently produce safe motion plans. In this paper, we cast the representation of 3D tabletop scenes as a multi-class classification problem. To tackle this, we introduce V-PRISM, a framework and method for robustly creating probabilistic 3D segmentation maps of tabletop scenes. Our maps contain both occupancy estimates, segmentation information, and principled uncertainty measures. We evaluate the robustness of our method in (1) procedurally generated scenes using open-source object datasets, and (2) real-world tabletop data collected from a depth camera. Our experiments show that our approach outperforms alternative continuous reconstruction approaches that do not explicitly reason about objects in a multi-class formulation.

V-PRISM: Probabilistic Mapping of Unknown Tabletop Scenes

TL;DR

V-PRISM casts 3D tabletop scene reconstruction as a multiclass probabilistic mapping problem , using a softmax Bayesian Hilbert map with hinge-point features. It employs an EM-based variational inference over a weight matrix and a novel object-centric negative sampling strategy to learn a posterior over classes, yielding both accurate geometry and principled per-voxel uncertainty . The system demonstrates robust performance on procedurally generated and real-world tabletop scenes, outperforming a voxel baseline and PointSDF on IoU and Chamfer metrics while providing uncertainty maps that highlight occluded regions. This uncertainty-aware representation supports safer, more reliable manipulation and planning in environments with unseen objects, and paves the way for active learning and dynamic tabletop scene handling.

Abstract

The ability to construct concise scene representations from sensor input is central to the field of robotics. This paper addresses the problem of robustly creating a 3D representation of a tabletop scene from a segmented RGB-D image. These representations are then critical for a range of downstream manipulation tasks. Many previous attempts to tackle this problem do not capture accurate uncertainty, which is required to subsequently produce safe motion plans. In this paper, we cast the representation of 3D tabletop scenes as a multi-class classification problem. To tackle this, we introduce V-PRISM, a framework and method for robustly creating probabilistic 3D segmentation maps of tabletop scenes. Our maps contain both occupancy estimates, segmentation information, and principled uncertainty measures. We evaluate the robustness of our method in (1) procedurally generated scenes using open-source object datasets, and (2) real-world tabletop data collected from a depth camera. Our experiments show that our approach outperforms alternative continuous reconstruction approaches that do not explicitly reason about objects in a multi-class formulation.
Paper Structure (16 sections, 1 theorem, 18 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 16 sections, 1 theorem, 18 equations, 6 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

From bouchard2007efficient. Let $\mathbf z \in \mathbb R^{c}$, $\alpha \in \mathbb R$, and $\xi \in \mathbb R_+^{c}$. Then the following inequality holds: where $\lambda(\xi_k) = ((1 + \exp(-\xi_k))^{-1} - (1 / 2))/2 \xi_k$.

Figures (6)

  • Figure 1: Our method takes a segmented ( top left) point cloud observation (top right) and builds a continuous probabilistic map. This map can be used to reconstruct the scene (bottom left) or measure uncertainty about the scene (bottom right). The heat map shows uncertainty in a 2D slice parallel with the table plane. Uncertainty is high in occluded areas.
  • Figure 2: Running a separate sigmoid model per object can cause unwanted intersections between the reconstructions (circled). Our multi-class formulation uses a softmax model that avoids this problem.
  • Figure 3: Overview of our method, V-PRISM. We take a segmented point cloud and output a probabilistic segmentation map over 3D space that can be used for both object reconstruction and principled uncertainty. Our method first generates negative samples and hinge points, then uses these to create an augmented dataset. Then the probabilistic map is constructed by running an EM algorithm over this dataset.
  • Figure 4: Overview of our sampling method. 1. We perform stratified sampling along camera rays within $r_\text{obj}$ of the object. 2. Points are sampled below the table within $r_\text{obj}$ of the object. 3. Grid subsampling is performed.
  • Figure 5: Qualitative comparison of uncertainty. Top row: the observed point cloud with a green plane corresponding to the 2D slice where the heat maps were calculated. We compare a non-probabilistic variant of V-PRISM trained with gradient descent (middle row) and our method (bottom row). In the heat maps, the bottom is closer to the camera and the top is farther from the camera. Lighter areas correspond to more uncertainty. Our method predicts high uncertainty in occluded areas of the scene.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem 1