Table of Contents
Fetching ...

PanopticRecon: Leverage Open-vocabulary Instance Segmentation for Zero-shot Panoptic Reconstruction

Xuan Yu, Yili Liu, Chenrui Han, Sitong Mao, Shunbo Zhou, Rong Xiong, Yiyi Liao, Yue Wang

TL;DR

PanopticRecon tackles zero-shot panoptic reconstruction from RGB-D by combining open-vocabulary instance segmentation with neural implicit surface modeling. It addresses partial labeling with DINOv2-driven label propagation and resolves cross-view instance associations via a 3D instance graph to enforce global ID uniqueness. The method employs a four-volume neural implicit representation for geometry, color, semantics, and instances, optimized through multi-task losses and volume rendering in a two-stage training scheme. Across ScanNet v2 and KITTI-360, PanopticRecon achieves strong geometry and panoptic labeling performance, highlighting the effectiveness of the 3D graph-based association and label propagation for open-vocabulary scenes.

Abstract

Panoptic reconstruction is a challenging task in 3D scene understanding. However, most existing methods heavily rely on pre-trained semantic segmentation models and known 3D object bounding boxes for 3D panoptic segmentation, which is not available for in-the-wild scenes. In this paper, we propose a novel zero-shot panoptic reconstruction method from RGB-D images of scenes. For zero-shot segmentation, we leverage open-vocabulary instance segmentation, but it has to face partial labeling and instance association challenges. We tackle both challenges by propagating partial labels with the aid of dense generalized features and building a 3D instance graph for associating 2D instance IDs. Specifically, we exploit partial labels to learn a classifier for generalized semantic features to provide complete labels for scenes with dense distilled features. Moreover, we formulate instance association as a 3D instance graph segmentation problem, allowing us to fully utilize the scene geometry prior and all 2D instance masks to infer global unique pseudo 3D instance ID. Our method outperforms state-of-the-art methods on the indoor dataset ScanNet V2 and the outdoor dataset KITTI-360, demonstrating the effectiveness of our graph segmentation method and reconstruction network.

PanopticRecon: Leverage Open-vocabulary Instance Segmentation for Zero-shot Panoptic Reconstruction

TL;DR

PanopticRecon tackles zero-shot panoptic reconstruction from RGB-D by combining open-vocabulary instance segmentation with neural implicit surface modeling. It addresses partial labeling with DINOv2-driven label propagation and resolves cross-view instance associations via a 3D instance graph to enforce global ID uniqueness. The method employs a four-volume neural implicit representation for geometry, color, semantics, and instances, optimized through multi-task losses and volume rendering in a two-stage training scheme. Across ScanNet v2 and KITTI-360, PanopticRecon achieves strong geometry and panoptic labeling performance, highlighting the effectiveness of the 3D graph-based association and label propagation for open-vocabulary scenes.

Abstract

Panoptic reconstruction is a challenging task in 3D scene understanding. However, most existing methods heavily rely on pre-trained semantic segmentation models and known 3D object bounding boxes for 3D panoptic segmentation, which is not available for in-the-wild scenes. In this paper, we propose a novel zero-shot panoptic reconstruction method from RGB-D images of scenes. For zero-shot segmentation, we leverage open-vocabulary instance segmentation, but it has to face partial labeling and instance association challenges. We tackle both challenges by propagating partial labels with the aid of dense generalized features and building a 3D instance graph for associating 2D instance IDs. Specifically, we exploit partial labels to learn a classifier for generalized semantic features to provide complete labels for scenes with dense distilled features. Moreover, we formulate instance association as a 3D instance graph segmentation problem, allowing us to fully utilize the scene geometry prior and all 2D instance masks to infer global unique pseudo 3D instance ID. Our method outperforms state-of-the-art methods on the indoor dataset ScanNet V2 and the outdoor dataset KITTI-360, demonstrating the effectiveness of our graph segmentation method and reconstruction network.
Paper Structure (15 sections, 15 equations, 7 figures, 4 tables)

This paper contains 15 sections, 15 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Zero-shot panoptic reconstruction by leveraging open-vocabulary instance segmentation faces two challenges: 1) 2D semantic labels provided by text prompt based VLMs are not complete. 2) No object-level instance 3D pseudo ID makes 2D instance ID inconsistent. We supplement blank pixel labels with distilled DINOv2 features and establish a graph to infer 3D instance pseudo IDs.
  • Figure 2: PanopticRecon consists of a reconstruction task and a segmentation task. The first step of the reconstruction task realizes the implicit surface reconstruction through RGB-D observations to provide the scene geometry for the segmentation task. Secondly, the segmentation task builds a graph from the normal of mesh, and infers 3D pseudo IDs to associate the 2D instance IDs by instance mask of Grounded SAM. In addition, 3D instance ID corrects some of the erroneous semantic labels. Then, the second reconstruction step realizes 2D-3D labeling supervised by consistent semantic and instance labels, and finally obtains the panoptic mesh, point cloud, and novel view images of the scene.
  • Figure 3: The points in the upper graph in (a) are the nodes (superpoints) of the graph. The color region corresponding to each node is the superface. We determine the nodes in an instance mask of a frame based on the overlap between the instance mask provided by Grounded SAM and the mask projected by the superface, and vote for the edges between the selected nodes. Similarly, we reduce the votes for the edges between nodes corresponding to masks of different instances in that frame. The edges with non-positive votes are finally cut and the nodes connected form an instance as shown in (b). Once the 3D instance pseudo IDs are obtained, we associate 2D instance IDs while correcting incorrect semantic labels.
  • Figure 4: The flow of the segmentation of the previous work is shown as (a) and (b), and ours is shown as (c).
  • Figure 5: The network architecture which is described in detail in Sec.E.
  • ...and 2 more figures