Table of Contents
Fetching ...

Efficient 3D Instance Mapping and Localization with Neural Fields

George Tang, Krishna Murthy Jatavallabhula, Antonio Torralba

TL;DR

This work introduces 3DIML, a novel framework that efficiently learns a label field that may be rendered from novel viewpoints to produce view-consistent instance segmentation masks and introduces InstanceLoc, which enables near realtime localization of instance masks given a trained label field and an off-the-shelf image segmentation model by fusing outputs from both.

Abstract

We tackle the problem of learning an implicit scene representation for 3D instance segmentation from a sequence of posed RGB images. Towards this, we introduce 3DIML, a novel framework that efficiently learns a neural label field which can render 3D instance segmentation masks from novel viewpoints. Opposed to prior art that optimizes a neural field in a self-supervised manner, requiring complicated training procedures and loss function design, 3DIML leverages a two-phase process. The first phase, InstanceMap, takes as input 2D segmentation masks of the image sequence generated by a frontend instance segmentation model, and associates corresponding masks across images to 3D labels. These almost 3D-consistent pseudolabel masks are then used in the second phase, InstanceLift, to supervise the training of a neural label field, which interpolates regions missed by InstanceMap and resolves ambiguities. Additionally, we introduce InstanceLoc, which enables near realtime localization of instance masks given a trained neural label field. We evaluate 3DIML on sequences from the Replica and ScanNet datasets and demonstrate its effectiveness under mild assumptions for the image sequences. We achieve a large practical speedup over existing implicit scene representation methods with comparable quality, showcasing its potential to facilitate faster and more effective 3D scene understanding.

Efficient 3D Instance Mapping and Localization with Neural Fields

TL;DR

This work introduces 3DIML, a novel framework that efficiently learns a label field that may be rendered from novel viewpoints to produce view-consistent instance segmentation masks and introduces InstanceLoc, which enables near realtime localization of instance masks given a trained label field and an off-the-shelf image segmentation model by fusing outputs from both.

Abstract

We tackle the problem of learning an implicit scene representation for 3D instance segmentation from a sequence of posed RGB images. Towards this, we introduce 3DIML, a novel framework that efficiently learns a neural label field which can render 3D instance segmentation masks from novel viewpoints. Opposed to prior art that optimizes a neural field in a self-supervised manner, requiring complicated training procedures and loss function design, 3DIML leverages a two-phase process. The first phase, InstanceMap, takes as input 2D segmentation masks of the image sequence generated by a frontend instance segmentation model, and associates corresponding masks across images to 3D labels. These almost 3D-consistent pseudolabel masks are then used in the second phase, InstanceLift, to supervise the training of a neural label field, which interpolates regions missed by InstanceMap and resolves ambiguities. Additionally, we introduce InstanceLoc, which enables near realtime localization of instance masks given a trained neural label field. We evaluate 3DIML on sequences from the Replica and ScanNet datasets and demonstrate its effectiveness under mild assumptions for the image sequences. We achieve a large practical speedup over existing implicit scene representation methods with comparable quality, showcasing its potential to facilitate faster and more effective 3D scene understanding.
Paper Structure (13 sections, 2 equations, 9 figures, 5 tables)

This paper contains 13 sections, 2 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Our approach, 3DIML, learns an implicit representation of a scene as a composition of object instances. It does so by lifting 2D view-inconsistent instance labels from off-the-shelf 2D segmentation models (such as the Segment Anything) into 3D view-consistent instance labels. The images above show results for the in-the-wild scan postdoc office generated using 3DIML, composed of InstanceMap (left) and InstanceLift. InstanceLoc (right) is then used to refine the results. Each identified 3D label is shown in a different color. Notice how thin and partially occluded objects are accurately delineated across the sequence.
  • Figure 2: Overview of 3DIML. A sequence of color images is segmented into object instances by an image segmentation backbone. The resulting masks produced are fed into InstanceMap, which produces instance masks consistent over all frames. These pseudo instance masks and their respective camera poses are used to supervise an instance label NeRF, which further improves consistency and resolves ambiguity present in the InstanceMap outputs. The feature extraction and global data association blocks together form InstanceMap.
  • Figure 3: InstanceLoc enables 3D-consistent instance segmentation for novel views of the scene unobserved by the InstanceMap pipeline. We leverage off-the-shelf instance segmentation models to first produce 3D-inconsistent instance labels for a new input image. We then query the label field over a sparse set of points on the image and use this to localize each 2D instance mask i.e., assign a 3D-consistent label to each mask.
  • Figure 4: Comparison between Panoptic Lifting and 3DIML for room0 from Replica-vMap
  • Figure 5: InstanceLift is able to fill in labels missed by InstanceMap as well as correct ambiguities. Here we show comparisons between them for office0 and room0 from Replica-vMap.
  • ...and 4 more figures