Table of Contents
Fetching ...

ISCUTE: Instance Segmentation of Cables Using Text Embedding

Shir Kozlovsky, Omkar Joglekar, Dotan Di Castro

TL;DR

This paper tackles the challenge of identifying and segmenting Deformable Linear Objects (DLOs) like cables, where traditional perceptual cues are weak. It introduces ISCUTE, an adapter that bridges CLIPSeg's text-conditioned segmentation with SAM's powerful prompting, enabling text-prompted, one-shot DLO instance segmentation while keeping both foundation models frozen. A CAD-generated, diverse DLO dataset (~30k images) supports training and evaluation, and the approach achieves a leading $mIoU$ around $92\%$ with strong zero-shot generalization to external DLO datasets. The work demonstrates practical impact by offering a user-friendly, text-driven solution for DLO perception, while identifying upper-bound limitations from the underlying foundation models and outlining future improvements to the classifier component and broader applicability.

Abstract

In the field of robotics and automation, conventional object recognition and instance segmentation methods face a formidable challenge when it comes to perceiving Deformable Linear Objects (DLOs) like wires, cables, and flexible tubes. This challenge arises primarily from the lack of distinct attributes such as shape, color, and texture, which calls for tailored solutions to achieve precise identification. In this work, we propose a foundation model-based DLO instance segmentation technique that is text-promptable and user-friendly. Specifically, our approach combines the text-conditioned semantic segmentation capabilities of CLIPSeg model with the zero-shot generalization capabilities of Segment Anything Model (SAM). We show that our method exceeds SOTA performance on DLO instance segmentation, achieving a mIoU of $91.21\%$. We also introduce a rich and diverse DLO-specific dataset for instance segmentation.

ISCUTE: Instance Segmentation of Cables Using Text Embedding

TL;DR

This paper tackles the challenge of identifying and segmenting Deformable Linear Objects (DLOs) like cables, where traditional perceptual cues are weak. It introduces ISCUTE, an adapter that bridges CLIPSeg's text-conditioned segmentation with SAM's powerful prompting, enabling text-prompted, one-shot DLO instance segmentation while keeping both foundation models frozen. A CAD-generated, diverse DLO dataset (~30k images) supports training and evaluation, and the approach achieves a leading around with strong zero-shot generalization to external DLO datasets. The work demonstrates practical impact by offering a user-friendly, text-driven solution for DLO perception, while identifying upper-bound limitations from the underlying foundation models and outlining future improvements to the classifier component and broader applicability.

Abstract

In the field of robotics and automation, conventional object recognition and instance segmentation methods face a formidable challenge when it comes to perceiving Deformable Linear Objects (DLOs) like wires, cables, and flexible tubes. This challenge arises primarily from the lack of distinct attributes such as shape, color, and texture, which calls for tailored solutions to achieve precise identification. In this work, we propose a foundation model-based DLO instance segmentation technique that is text-promptable and user-friendly. Specifically, our approach combines the text-conditioned semantic segmentation capabilities of CLIPSeg model with the zero-shot generalization capabilities of Segment Anything Model (SAM). We show that our method exceeds SOTA performance on DLO instance segmentation, achieving a mIoU of . We also introduce a rich and diverse DLO-specific dataset for instance segmentation.
Paper Structure (22 sections, 14 figures, 5 tables)

This paper contains 22 sections, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Overview of the full pipeline - blocks in red represent our additions
  • Figure 2: Issues with using SAM out-of-the-box
  • Figure 3: The ISCUTE adapter: on the left, the architecture of the prompt encoder network is outlined (indicated by the purple dashed line), while on the right, the classifier network architecture is detailed (represented by the pink dashed line).
  • Figure 4: Qualitative comparison in specific scenarios. Each scenario demonstrates the following: (a) and (b) real images, (c) identical colors, (d) a high density of cables in a single image, and (e) and (f) small DLOs at the edge of the image with varying thicknesses.
  • Figure 5: A qualitative comparison of our model vs. the SOTA baselines
  • ...and 9 more figures