Table of Contents
Fetching ...

Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes

Jeongho Noh, Tai Hyoung Rhee, Eunho Lee, Jeongyun Kim, Sunwoo Lee, Ayoung Kim

TL;DR

This work tackles robust 3D instance segmentation for language-grounded grasping in cluttered scenes with sparse views. It introduces Clutt3R-Seg, a zero-shot pipeline that builds a hierarchy-based instance tree from noisy 2D masks and uses cross-view grouping with conditional substitution to produce view-consistent 3D instances enriched with open-vocabulary embeddings for grounding. It adds a consistency-aware update that preserves instance correspondences from a single post-interaction image, enabling efficient multi-stage grasping as objects move. Evaluations on real and synthetic data show substantial gains over state-of-the-art baselines, achieving $AP_{25}$ up to $61.66$ with as few as $4$ input views and validating practical robotic deployment with a real robot.

Abstract

Reliable 3D instance segmentation is fundamental to language-grounded robotic manipulation. Its critical application lies in cluttered environments, where occlusions, limited viewpoints, and noisy masks degrade perception. To address these challenges, we present Clutt3R-Seg, a zero-shot pipeline for robust 3D instance segmentation for language-grounded grasping in cluttered scenes. Our key idea is to introduce a hierarchical instance tree of semantic cues. Unlike prior approaches that attempt to refine noisy masks, our method leverages them as informative cues: through cross-view grouping and conditional substitution, the tree suppresses over- and under-segmentation, yielding view-consistent masks and robust 3D instances. Each instance is enriched with open-vocabulary semantic embeddings, enabling accurate target selection from natural language instructions. To handle scene changes during multi-stage tasks, we further introduce a consistency-aware update that preserves instance correspondences from only a single post-interaction image, allowing efficient adaptation without rescanning. Clutt3R-Seg is evaluated on both synthetic and real-world datasets, and validated on a real robot. Across all settings, it consistently outperforms state-of-the-art baselines in cluttered and sparse-view scenarios. Even on the most challenging heavy-clutter sequences, Clutt3R-Seg achieves an AP@25 of 61.66, over 2.2x higher than baselines, and with only four input views it surpasses MaskClustering with eight views by more than 2x. The code is available at: https://github.com/jeonghonoh/clutt3r-seg.

Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes

TL;DR

This work tackles robust 3D instance segmentation for language-grounded grasping in cluttered scenes with sparse views. It introduces Clutt3R-Seg, a zero-shot pipeline that builds a hierarchy-based instance tree from noisy 2D masks and uses cross-view grouping with conditional substitution to produce view-consistent 3D instances enriched with open-vocabulary embeddings for grounding. It adds a consistency-aware update that preserves instance correspondences from a single post-interaction image, enabling efficient multi-stage grasping as objects move. Evaluations on real and synthetic data show substantial gains over state-of-the-art baselines, achieving up to with as few as input views and validating practical robotic deployment with a real robot.

Abstract

Reliable 3D instance segmentation is fundamental to language-grounded robotic manipulation. Its critical application lies in cluttered environments, where occlusions, limited viewpoints, and noisy masks degrade perception. To address these challenges, we present Clutt3R-Seg, a zero-shot pipeline for robust 3D instance segmentation for language-grounded grasping in cluttered scenes. Our key idea is to introduce a hierarchical instance tree of semantic cues. Unlike prior approaches that attempt to refine noisy masks, our method leverages them as informative cues: through cross-view grouping and conditional substitution, the tree suppresses over- and under-segmentation, yielding view-consistent masks and robust 3D instances. Each instance is enriched with open-vocabulary semantic embeddings, enabling accurate target selection from natural language instructions. To handle scene changes during multi-stage tasks, we further introduce a consistency-aware update that preserves instance correspondences from only a single post-interaction image, allowing efficient adaptation without rescanning. Clutt3R-Seg is evaluated on both synthetic and real-world datasets, and validated on a real robot. Across all settings, it consistently outperforms state-of-the-art baselines in cluttered and sparse-view scenarios. Even on the most challenging heavy-clutter sequences, Clutt3R-Seg achieves an AP@25 of 61.66, over 2.2x higher than baselines, and with only four input views it surpasses MaskClustering with eight views by more than 2x. The code is available at: https://github.com/jeonghonoh/clutt3r-seg.
Paper Structure (14 sections, 6 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 14 sections, 6 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Clutt3R-Seg resolves over- and under-segmentation and cross-view inconsistencies from erroneous noisy masks in sparse-view cluttered scenes through a hierarchy-based instance mask grouping algorithm, yielding view-consistent masks and robust 3D instance segmentation. These consistent 3D instances enable language-grounded target identification and consistency-aware multi-stage grasping by detecting displaced objects and optimizing their poses after interaction.
  • Figure 2: Overview of Clutt3R-Seg pipeline. (a) From posed sparse-view RGB inputs, we estimate depth izquierdo2025mvsanywhere and obtain noisy instance masks ren2024grounded with the prompt "object". (b) A hierarchy-based grouping yields view-consistent instance mask groups robust to over-/under-segmentation, enabling reliable 3D instance segmentation in cluttered scenes. Enriched semantic embedding allows text-based target identification. (c) A single post-interaction image is associated with prior instances to preserve segmentation consistency. (d) The system detects the new target and displaced objects, optimizing their rigid transformation via a differentiable loss for reliable multi-stage grasping.
  • Figure 3: Hierarchical structure of masks. Grounded SAM outputs masks for all object-clusters, objects and sub-objects corresponding to under-segmentation, proper segmentation and over-segmentation, respectively. Such noisy masks in cluttered scenes are organized into hierarchical trees, assigning a child node to a parent node if its mask is contained within the pixel level. In this hierarchy, at most one proper segment exists per root-to-leaf path, enabling bottom-up grouping of leaf nodes across frames for instance consistency.
  • Figure 4: Outline of mask grouping and substitution. Initially inconsistent instance masks are properly organized to their corresponding instance groups via hierarchy-based instance mask grouping (Algorithm. \ref{['algo:1']}), and further refined by re-grouping via residual-node parent substitution, resulting in properly segmented instances. The red box illustrates an over-segmentation before substitution, while the green box shows the corrected segmentation after substituting the residual node with its parent.
  • Figure 5: Result of multi-stage target identification and consistency-aware update. Robust 3D instances with enriched semantics preserve consistency under dynamic scene changes, enabling accurate target identification and scene updates in clutter.
  • ...and 4 more figures