Table of Contents
Fetching ...

CoSMo3D: Open-World Promptable 3D Semantic Part Segmentation through LLM-Guided Canonical Spatial Modeling

Li Jin, Weikai Chen, Yujie Wang, Yingda Yin, Zeyu Hu, Runze Zhang, Keyang Luo, Shengju Qian, Xin Wang, Xueying Qin

TL;DR

This work proposes \methodName{}, which attains canonical space perception by inducing a latent canonical reference frame learned directly from data, and establishes new state of the art in open-world promptable 3D segmentation.

Abstract

Open-world promptable 3D semantic segmentation remains brittle as semantics are inferred in the input sensor coordinates. Yet, humans, in contrast, interpret parts via functional roles in a canonical space -- wings extend laterally, handles protrude to the side, and legs support from below. Psychophysical evidence shows that we mentally rotate objects into canonical frames to reveal these roles. To fill this gap, we propose \methodName{}, which attains canonical space perception by inducing a latent canonical reference frame learned directly from data. By construction, we create a unified canonical dataset through LLM-guided intra- and cross-category alignment, exposing canonical spatial regularities across 200 categories. By induction, we realize canonicality inside the model through a dual-branch architecture with canonical map anchoring and canonical box calibration, collapsing pose variation and symmetry into a stable canonical embedding. This shift from input pose space to canonical embedding yields far more stable and transferable part semantics. Experimental results show that \methodName{} establishes new state of the art in open-world promptable 3D segmentation.

CoSMo3D: Open-World Promptable 3D Semantic Part Segmentation through LLM-Guided Canonical Spatial Modeling

TL;DR

This work proposes \methodName{}, which attains canonical space perception by inducing a latent canonical reference frame learned directly from data, and establishes new state of the art in open-world promptable 3D segmentation.

Abstract

Open-world promptable 3D semantic segmentation remains brittle as semantics are inferred in the input sensor coordinates. Yet, humans, in contrast, interpret parts via functional roles in a canonical space -- wings extend laterally, handles protrude to the side, and legs support from below. Psychophysical evidence shows that we mentally rotate objects into canonical frames to reveal these roles. To fill this gap, we propose \methodName{}, which attains canonical space perception by inducing a latent canonical reference frame learned directly from data. By construction, we create a unified canonical dataset through LLM-guided intra- and cross-category alignment, exposing canonical spatial regularities across 200 categories. By induction, we realize canonicality inside the model through a dual-branch architecture with canonical map anchoring and canonical box calibration, collapsing pose variation and symmetry into a stable canonical embedding. This shift from input pose space to canonical embedding yields far more stable and transferable part semantics. Experimental results show that \methodName{} establishes new state of the art in open-world promptable 3D segmentation.
Paper Structure (14 sections, 5 equations, 6 figures, 2 tables)

This paper contains 14 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: We propose CoSMo3D, an open-world promptable 3D semantic segmentation method. It introduces canonical space perception to break the limitation of any pose and shape, achieves state-of-the-art performance across multiple settings, and significantly outperforms geometry-mapping-only methods.
  • Figure 2: We propose a dual-branch framework for open-world promptable 3D semantic segmentation: the feature extraction branch encodes 3D shape features (via Point Transformer) and text semantic features (via SigLIP) to enable cross-modal part segmentation. A training-only canonical embedding branch then enforces consistent canonical space perception via semantic contrastive alignment, canonical map anchoring, and canonical box calibration losses, ensuring robust reasoning across any shape in any pose.
  • Figure 3: (a) Prior works perform category-level canonicalization, aligning intra-category shapes but neglecting cross-category consistency. (b) We cluster categories via LLM and align different categories relying on key semantic parts and functional consistency.
  • Figure 4: Handling Symmetric Objects for Canonical Map Anchoring. (a) For symmetric shapes, multiple valid poses induce ambiguous point-wise canonical labels, making direct point-wise supervision unreliable. (b) Prior works rely on category-specific processing with manual symmetry annotation, limiting open-world scalability. (c) We instead apply an order-invariant set loss on RGB-encoded canonical coordinates that matches the overall layout of semantic parts in canonical space while remaining robust to symmetric pose ambiguities.
  • Figure 5: Qualitative comparison of promptable 3D part segmentation. Across challenging cases (similar geometry with different semantics, noise-prone objects, cross-category semantics, and arbitrary poses), our method produces more accurate and consistent part localizations than existing baselines.
  • ...and 1 more figures