Few-shot Structure-Informed Machinery Part Segmentation with Foundation Models and Graph Neural Networks
Michael Schwingshackl, Fabio Francisco Oberweger, Markus Murschitz
TL;DR
This work tackles the challenge of few-shot semantic segmentation of machinery parts by orchestrating foundation models (CLIPSeg and SAM) with SuperPoint features and a trainable graph neural network. By representing images as graphs of enhanced interest points and leveraging SAM prompts derived from a structured graph, the approach achieves accurate part segmentation with as few as one annotation and training times under a few minutes on consumer GPUs. The synthetic data pipeline enables rapid development, while synthetic-to-real evaluation demonstrates strong generalization to real crane imagery, complemented by competitive semi-supervised performance on DAVIS 2017 without temporal modeling. Overall, the method highlights the potential of integrating foundation models with graph-based reasoning to enable fast, adaptable segmentation for dynamic machinery and infrastructure contexts.
Abstract
This paper proposes a novel approach to few-shot semantic segmentation for machinery with multiple parts that exhibit spatial and hierarchical relationships. Our method integrates the foundation models CLIPSeg and Segment Anything Model (SAM) with the interest point detector SuperPoint and a graph convolutional network (GCN) to accurately segment machinery parts. By providing 1 to 25 annotated samples, our model, evaluated on a purely synthetic dataset depicting a truck-mounted loading crane, achieves effective segmentation across various levels of detail. Training times are kept under five minutes on consumer GPUs. The model demonstrates robust generalization to real data, achieving a qualitative synthetic-to-real generalization with a $J\&F$ score of 92.2 on real data using 10 synthetic support samples. When benchmarked on the DAVIS 2017 dataset, it achieves a $J\&F$ score of 71.5 in semi-supervised video segmentation with three support samples. This method's fast training times and effective generalization to real data make it a valuable tool for autonomous systems interacting with machinery and infrastructure, and illustrate the potential of combined and orchestrated foundation models for few-shot segmentation tasks.
