SuperPrimitive: Scene Reconstruction at a Primitive Level
Kirill Mazur, Gwangbin Bae, Andrew J. Davison
TL;DR
This paper introduces SuperPrimitives, a primitive-level image representation that leverages single-image priors and multi-view optimization to address monocular scene reconstruction. Each SuperPrimitive is an image region with an unscaled depth map, $D = s \mathfrak{D}$, where the depth scale $s$ is learned per primitive while camera poses are jointly optimized to satisfy photometric consistency across views. The front-end uses Segment Anything for segmentation and a surface-normal predictor to generate local geometry, while the back-end performs primitive-based two-view SfM and monocular VO, enabling zero-shot depth completion, few-view SfM, and robust VO on diverse datasets. Experiments on VOID, ScanNet, and TUM demonstrate competitive depth completion without training, efficient multi-view depth estimation with few views, and superior monocular odometry performance, highlighting the practical impact of adopting a primitive-level representation in monocular reconstruction.
Abstract
Joint camera pose and dense geometry estimation from a set of images or a monocular video remains a challenging problem due to its computational complexity and inherent visual ambiguities. Most dense incremental reconstruction systems operate directly on image pixels and solve for their 3D positions using multi-view geometry cues. Such pixel-level approaches suffer from ambiguities or violations of multi-view consistency (e.g. caused by textureless or specular surfaces). We address this issue with a new image representation which we call a SuperPrimitive. SuperPrimitives are obtained by splitting images into semantically correlated local regions and enhancing them with estimated surface normal directions, both of which are predicted by state-of-the-art single image neural networks. This provides a local geometry estimate per SuperPrimitive, while their relative positions are adjusted based on multi-view observations. We demonstrate the versatility of our new representation by addressing three 3D reconstruction tasks: depth completion, few-view structure from motion, and monocular dense visual odometry.
