Guided SAM: Label-Efficient Part Segmentation
S. B. van Rooij, G. J. Burghouts
TL;DR
Guided SAM addresses the challenge of fine-grained object part segmentation with minimal supervision by learning positional prompts from coarse patch annotations and aggregating patches into ROIs to steer Segment-Anything Model. A patch-based pipeline—including prototypical patches, patch selection, patch annotation, and a guidance classifier—produces ROI-centered prompts that condition SAM for targeted part segmentation. On the CarParts dataset, Guided SAM outperforms prior methods, increasing average IoU from $0.37$ to $0.49$ while requiring patch annotations that are roughly five times cheaper to obtain than full part masks. The approach demonstrates a practical, data-efficient pathway to extend foundation models to part-level understanding with broad potential for robotics and recognition tasks.
Abstract
Localizing object parts precisely is essential for tasks such as object recognition and robotic manipulation. Recent part segmentation methods require extensive training data and labor-intensive annotations. Segment-Anything Model (SAM) has demonstrated good performance on a wide range of segmentation problems, but requires (manual) positional prompts to guide it where to segment. Furthermore, since it has been trained on full objects instead of object parts, it is prone to over-segmentation of parts. To address this, we propose a novel approach that guides SAM towards the relevant object parts. Our method learns positional prompts from coarse patch annotations that are easier and cheaper to acquire. We train classifiers on image patches to identify part classes and aggregate patches into regions of interest (ROIs) with positional prompts. SAM is conditioned on these ROIs and prompts. This approach, termed `Guided SAM', enhances efficiency and reduces manual effort, allowing effective part segmentation with minimal labeled data. We demonstrate the efficacy of Guided SAM on a dataset of car parts, improving the average IoU on state of the art models from 0.37 to 0.49 with annotations that are on average five times more efficient to acquire.
