Table of Contents
Fetching ...

SuperPrimitive: Scene Reconstruction at a Primitive Level

Kirill Mazur, Gwangbin Bae, Andrew J. Davison

TL;DR

This paper introduces SuperPrimitives, a primitive-level image representation that leverages single-image priors and multi-view optimization to address monocular scene reconstruction. Each SuperPrimitive is an image region with an unscaled depth map, $D = s \mathfrak{D}$, where the depth scale $s$ is learned per primitive while camera poses are jointly optimized to satisfy photometric consistency across views. The front-end uses Segment Anything for segmentation and a surface-normal predictor to generate local geometry, while the back-end performs primitive-based two-view SfM and monocular VO, enabling zero-shot depth completion, few-view SfM, and robust VO on diverse datasets. Experiments on VOID, ScanNet, and TUM demonstrate competitive depth completion without training, efficient multi-view depth estimation with few views, and superior monocular odometry performance, highlighting the practical impact of adopting a primitive-level representation in monocular reconstruction.

Abstract

Joint camera pose and dense geometry estimation from a set of images or a monocular video remains a challenging problem due to its computational complexity and inherent visual ambiguities. Most dense incremental reconstruction systems operate directly on image pixels and solve for their 3D positions using multi-view geometry cues. Such pixel-level approaches suffer from ambiguities or violations of multi-view consistency (e.g. caused by textureless or specular surfaces). We address this issue with a new image representation which we call a SuperPrimitive. SuperPrimitives are obtained by splitting images into semantically correlated local regions and enhancing them with estimated surface normal directions, both of which are predicted by state-of-the-art single image neural networks. This provides a local geometry estimate per SuperPrimitive, while their relative positions are adjusted based on multi-view observations. We demonstrate the versatility of our new representation by addressing three 3D reconstruction tasks: depth completion, few-view structure from motion, and monocular dense visual odometry.

SuperPrimitive: Scene Reconstruction at a Primitive Level

TL;DR

This paper introduces SuperPrimitives, a primitive-level image representation that leverages single-image priors and multi-view optimization to address monocular scene reconstruction. Each SuperPrimitive is an image region with an unscaled depth map, , where the depth scale is learned per primitive while camera poses are jointly optimized to satisfy photometric consistency across views. The front-end uses Segment Anything for segmentation and a surface-normal predictor to generate local geometry, while the back-end performs primitive-based two-view SfM and monocular VO, enabling zero-shot depth completion, few-view SfM, and robust VO on diverse datasets. Experiments on VOID, ScanNet, and TUM demonstrate competitive depth completion without training, efficient multi-view depth estimation with few views, and superior monocular odometry performance, highlighting the practical impact of adopting a primitive-level representation in monocular reconstruction.

Abstract

Joint camera pose and dense geometry estimation from a set of images or a monocular video remains a challenging problem due to its computational complexity and inherent visual ambiguities. Most dense incremental reconstruction systems operate directly on image pixels and solve for their 3D positions using multi-view geometry cues. Such pixel-level approaches suffer from ambiguities or violations of multi-view consistency (e.g. caused by textureless or specular surfaces). We address this issue with a new image representation which we call a SuperPrimitive. SuperPrimitives are obtained by splitting images into semantically correlated local regions and enhancing them with estimated surface normal directions, both of which are predicted by state-of-the-art single image neural networks. This provides a local geometry estimate per SuperPrimitive, while their relative positions are adjusted based on multi-view observations. We demonstrate the versatility of our new representation by addressing three 3D reconstruction tasks: depth completion, few-view structure from motion, and monocular dense visual odometry.
Paper Structure (36 sections, 9 equations, 7 figures, 5 tables)

This paper contains 36 sections, 9 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Multi-View Geometry with SuperPrimitives. SuperPrimitives are extracted from an input frame by dividing it into image segments equipped with estimated surface normal directions (bottom-left). Each SuperPrimitive induces a dense reconstruction within the corresponding image segment up to a priori unknown scale. Different possible reconstructions are shown in light blue. The scales are then jointly optimised together with a relative camera pose to fit multi-view photometric constraints (visualised in green and red). The resulting dense reconstruction of the reference frame is shown in the top.
  • Figure 2: SuperPrimitves Extraction.(left) Our front-end processor extracts SuperPrimitivies from an image by dividing it into a set of image regions with surface normal directions estimated for each image pixel within the segment. (right) Highlighted SuperPrimitives extracted from the image are visualised by showing their estimated normal and colour maps side by side. Note some of them are scaled either up or down for better viewing. While some of the SuperPrimitives are akin to object-level segmentation, the others tend to represent more low-level image segments.
  • Figure 3: Depth Completion on VOID. We visualise the coloured unprojections of ground truth depth maps provided by a sensor (top row) and the geometry estimated by our method (bottom row). Sparse depth input points are visualised as red dots (electronic zoom-in recommended). Qualitatively, we achieve sharper geometry estimates than from a commodity depth sensor.
  • Figure 4: 3-View SfM on ScanNet. We provide the visualisations of unprojected reference frame depth maps predicted by our method for few-view SfM using one reference and $2$ supplementary views. Note that we used surface normal prediction network which was only pretrained on HyperSim roberts:2021 for this experiment.
  • Figure 5: TUM Reconstruction Results. Examples of reconstructions produced by our monocular VO system on the TUM dataset. Each image shows a coloured point cloud of the geometry estimated on an odometry keyframe.
  • ...and 2 more figures