V-PRISM: Probabilistic Mapping of Unknown Tabletop Scenes
Herbert Wright, Weiming Zhi, Matthew Johnson-Roberson, Tucker Hermans
TL;DR
V-PRISM casts 3D tabletop scene reconstruction as a multiclass probabilistic mapping problem $m:\mathbb{R}^d\to[0,1]^c$, using a softmax Bayesian Hilbert map with hinge-point features. It employs an EM-based variational inference over a weight matrix $\boldsymbol W$ and a novel object-centric negative sampling strategy to learn a posterior over classes, yielding both accurate geometry and principled per-voxel uncertainty $\mathbb{E}_{\boldsymbol W}[\text{softmax}(\boldsymbol W\boldsymbol\phi(\boldsymbol x))]$. The system demonstrates robust performance on procedurally generated and real-world tabletop scenes, outperforming a voxel baseline and PointSDF on IoU and Chamfer metrics while providing uncertainty maps that highlight occluded regions. This uncertainty-aware representation supports safer, more reliable manipulation and planning in environments with unseen objects, and paves the way for active learning and dynamic tabletop scene handling.
Abstract
The ability to construct concise scene representations from sensor input is central to the field of robotics. This paper addresses the problem of robustly creating a 3D representation of a tabletop scene from a segmented RGB-D image. These representations are then critical for a range of downstream manipulation tasks. Many previous attempts to tackle this problem do not capture accurate uncertainty, which is required to subsequently produce safe motion plans. In this paper, we cast the representation of 3D tabletop scenes as a multi-class classification problem. To tackle this, we introduce V-PRISM, a framework and method for robustly creating probabilistic 3D segmentation maps of tabletop scenes. Our maps contain both occupancy estimates, segmentation information, and principled uncertainty measures. We evaluate the robustness of our method in (1) procedurally generated scenes using open-source object datasets, and (2) real-world tabletop data collected from a depth camera. Our experiments show that our approach outperforms alternative continuous reconstruction approaches that do not explicitly reason about objects in a multi-class formulation.
