Table of Contents
Fetching ...

GraphSeg: Segmented 3D Representations via Graph Edge Addition and Contraction

Haozhan Tang, Tianyi Zhang, Oliver Kroemer, Matthew Johnson-Roberson, Weiming Zhi

TL;DR

GraphSeg addresses the problem of obtaining consistent object-level 3D segmentations from sparse multi-view RGB images without depth, by formulating segmentation as an edge-addition and contraction problem over dual graphs that fuse 2D mask correspondences with 3D structural cues from foundation models. The method constructs a 2D correspondence graph $G_{2d}$ and a 3D structure graph $G_{3D}$ via pixel-level matches and Chamfer-based similarities of lifted 3D point clouds, then contracts connected vertices to yield coherent 3D objects. Empirical evaluation on GraspNet-1B demonstrates state-of-the-art segmentation quality, high pixel utility, robustness to sparse views, and clear benefits for downstream robotic manipulation. The approach relies on 3D foundation models to recover geometry, enabling dense, open-vocabulary 3D representations that support real-world grasping tasks with limited images.

Abstract

Robots operating in unstructured environments often require accurate and consistent object-level representations. This typically requires segmenting individual objects from the robot's surroundings. While recent large models such as Segment Anything (SAM) offer strong performance in 2D image segmentation. These advances do not translate directly to performance in the physical 3D world, where they often over-segment objects and fail to produce consistent mask correspondences across views. In this paper, we present GraphSeg, a framework for generating consistent 3D object segmentations from a sparse set of 2D images of the environment without any depth information. GraphSeg adds edges to graphs and constructs dual correspondence graphs: one from 2D pixel-level similarities and one from inferred 3D structure. We formulate segmentation as a problem of edge addition, then subsequent graph contraction, which merges multiple 2D masks into unified object-level segmentations. We can then leverage \emph{3D foundation models} to produce segmented 3D representations. GraphSeg achieves robust segmentation with significantly fewer images and greater accuracy than prior methods. We demonstrate state-of-the-art performance on tabletop scenes and show that GraphSeg enables improved performance on downstream robotic manipulation tasks. Code available at https://github.com/tomtang502/graphseg.git.

GraphSeg: Segmented 3D Representations via Graph Edge Addition and Contraction

TL;DR

GraphSeg addresses the problem of obtaining consistent object-level 3D segmentations from sparse multi-view RGB images without depth, by formulating segmentation as an edge-addition and contraction problem over dual graphs that fuse 2D mask correspondences with 3D structural cues from foundation models. The method constructs a 2D correspondence graph and a 3D structure graph via pixel-level matches and Chamfer-based similarities of lifted 3D point clouds, then contracts connected vertices to yield coherent 3D objects. Empirical evaluation on GraspNet-1B demonstrates state-of-the-art segmentation quality, high pixel utility, robustness to sparse views, and clear benefits for downstream robotic manipulation. The approach relies on 3D foundation models to recover geometry, enabling dense, open-vocabulary 3D representations that support real-world grasping tasks with limited images.

Abstract

Robots operating in unstructured environments often require accurate and consistent object-level representations. This typically requires segmenting individual objects from the robot's surroundings. While recent large models such as Segment Anything (SAM) offer strong performance in 2D image segmentation. These advances do not translate directly to performance in the physical 3D world, where they often over-segment objects and fail to produce consistent mask correspondences across views. In this paper, we present GraphSeg, a framework for generating consistent 3D object segmentations from a sparse set of 2D images of the environment without any depth information. GraphSeg adds edges to graphs and constructs dual correspondence graphs: one from 2D pixel-level similarities and one from inferred 3D structure. We formulate segmentation as a problem of edge addition, then subsequent graph contraction, which merges multiple 2D masks into unified object-level segmentations. We can then leverage \emph{3D foundation models} to produce segmented 3D representations. GraphSeg achieves robust segmentation with significantly fewer images and greater accuracy than prior methods. We demonstrate state-of-the-art performance on tabletop scenes and show that GraphSeg enables improved performance on downstream robotic manipulation tasks. Code available at https://github.com/tomtang502/graphseg.git.

Paper Structure

This paper contains 19 sections, 9 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Robots often need to work with object-level representations. In this work, we tackle the problem of building segmented 3D scenes from a set of images, and introduce the GraphSeg framework. GraphSeg solves multi-view 3D segmentation via a novel graph edge addition and contraction procedure. This facilitates downstream robot manipulation
  • Figure 2: GraphSeg enables consistent 3D segmentation. We can obtain a set of segmented 2D images by leveraging the pre-trained open-vocabulary segmentation model. We then leverage edge addition via correspondence and graph contraction, over both 2D and lifted 3D representations, to obtain segmented 3D representations.
  • Figure 3: At the core of GraphSeg is an edge addition and graph contraction process. The edge addition is achieved by finding correspondences between masks, via pixel-to-pixel features and 3D structural information.
  • Figure 4: We can find correspondence between masks by considering pixel-level correspondence between images. Here we see examples of some correspondences between two images at different views.
  • Figure 5: GraphSeg can take over-segmented 2d images (e.g. the labels have been unnecessarily segmented from the can and bottle) and produce consistent 3D segmentations (shown in the center).
  • ...and 8 more figures