Table of Contents
Fetching ...

CAD-Prompted SAM3: Geometry-Conditioned Instance Segmentation for Industrial Objects

Zhenran Tang, Rohan Nagabhirava, Changliu Liu

TL;DR

This work proposes a CAD-prompted segmentation framework built on SAM3 that uses canonical multi-view renderings of a CAD model as prompt input, and enables single-stage, CAD-prompted mask prediction, extending promptable segmentation to objects that cannot be robustly described by language or appearance alone.

Abstract

Verbal-prompted segmentation is inherently limited by the expressiveness of natural language and struggles with uncommon, instance-specific, or difficult-to-describe objects: scenarios frequently encountered in manufacturing and 3D printing environments. While image exemplars provide an alternative, they primarily encode appearance cues such as color and texture, which are often unrelated to a part's geometric identity. In industrial settings, a single component may be produced in different materials, finishes, or colors, making appearance-based prompting unreliable. In contrast, such objects are typically defined by precise CAD models that capture their canonical geometry. We propose a CAD-prompted segmentation framework built on SAM3 that uses canonical multi-view renderings of a CAD model as prompt input. The rendered views provide geometry-based conditioning independent of surface appearance. The model is trained using synthetic data generated from mesh renderings in simulation under diverse viewpoints and scene contexts. Our approach enables single-stage, CAD-prompted mask prediction, extending promptable segmentation to objects that cannot be robustly described by language or appearance alone.

CAD-Prompted SAM3: Geometry-Conditioned Instance Segmentation for Industrial Objects

TL;DR

This work proposes a CAD-prompted segmentation framework built on SAM3 that uses canonical multi-view renderings of a CAD model as prompt input, and enables single-stage, CAD-prompted mask prediction, extending promptable segmentation to objects that cannot be robustly described by language or appearance alone.

Abstract

Verbal-prompted segmentation is inherently limited by the expressiveness of natural language and struggles with uncommon, instance-specific, or difficult-to-describe objects: scenarios frequently encountered in manufacturing and 3D printing environments. While image exemplars provide an alternative, they primarily encode appearance cues such as color and texture, which are often unrelated to a part's geometric identity. In industrial settings, a single component may be produced in different materials, finishes, or colors, making appearance-based prompting unreliable. In contrast, such objects are typically defined by precise CAD models that capture their canonical geometry. We propose a CAD-prompted segmentation framework built on SAM3 that uses canonical multi-view renderings of a CAD model as prompt input. The rendered views provide geometry-based conditioning independent of surface appearance. The model is trained using synthetic data generated from mesh renderings in simulation under diverse viewpoints and scene contexts. Our approach enables single-stage, CAD-prompted mask prediction, extending promptable segmentation to objects that cannot be robustly described by language or appearance alone.
Paper Structure (29 sections, 13 equations, 4 figures, 2 tables)

This paper contains 29 sections, 13 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Qualitative comparison of CAD-prompted SAM3 with SAM3 (exemplar and text prompts), Matcher, and PerSAM. Our approach produces more accurate instance masks given identical CAD-based prompts. Text prompts for SAM3 are generated from the CAD renderings using GPT-5.1.
  • Figure 2: Model architecture. Canonical multi-view renderings of a CAD mesh are encoded into geometry-aware embeddings, which are injected as cross-image prompt tokens into a fusion transformer. The fused features are processed by the SAM3 detection and mask heads to produce instance masks in one forward pass.
  • Figure 3: Geometry-conditioned training pipeline. (a) Canonical CAD renderings serve as prompts. (b) Synthetic training data with instance annotations.
  • Figure 4: Representative samples from the custom 3D printing dataset.