Table of Contents
Fetching ...

Few-shot Structure-Informed Machinery Part Segmentation with Foundation Models and Graph Neural Networks

Michael Schwingshackl, Fabio Francisco Oberweger, Markus Murschitz

TL;DR

This work tackles the challenge of few-shot semantic segmentation of machinery parts by orchestrating foundation models (CLIPSeg and SAM) with SuperPoint features and a trainable graph neural network. By representing images as graphs of enhanced interest points and leveraging SAM prompts derived from a structured graph, the approach achieves accurate part segmentation with as few as one annotation and training times under a few minutes on consumer GPUs. The synthetic data pipeline enables rapid development, while synthetic-to-real evaluation demonstrates strong generalization to real crane imagery, complemented by competitive semi-supervised performance on DAVIS 2017 without temporal modeling. Overall, the method highlights the potential of integrating foundation models with graph-based reasoning to enable fast, adaptable segmentation for dynamic machinery and infrastructure contexts.

Abstract

This paper proposes a novel approach to few-shot semantic segmentation for machinery with multiple parts that exhibit spatial and hierarchical relationships. Our method integrates the foundation models CLIPSeg and Segment Anything Model (SAM) with the interest point detector SuperPoint and a graph convolutional network (GCN) to accurately segment machinery parts. By providing 1 to 25 annotated samples, our model, evaluated on a purely synthetic dataset depicting a truck-mounted loading crane, achieves effective segmentation across various levels of detail. Training times are kept under five minutes on consumer GPUs. The model demonstrates robust generalization to real data, achieving a qualitative synthetic-to-real generalization with a $J\&F$ score of 92.2 on real data using 10 synthetic support samples. When benchmarked on the DAVIS 2017 dataset, it achieves a $J\&F$ score of 71.5 in semi-supervised video segmentation with three support samples. This method's fast training times and effective generalization to real data make it a valuable tool for autonomous systems interacting with machinery and infrastructure, and illustrate the potential of combined and orchestrated foundation models for few-shot segmentation tasks.

Few-shot Structure-Informed Machinery Part Segmentation with Foundation Models and Graph Neural Networks

TL;DR

This work tackles the challenge of few-shot semantic segmentation of machinery parts by orchestrating foundation models (CLIPSeg and SAM) with SuperPoint features and a trainable graph neural network. By representing images as graphs of enhanced interest points and leveraging SAM prompts derived from a structured graph, the approach achieves accurate part segmentation with as few as one annotation and training times under a few minutes on consumer GPUs. The synthetic data pipeline enables rapid development, while synthetic-to-real evaluation demonstrates strong generalization to real crane imagery, complemented by competitive semi-supervised performance on DAVIS 2017 without temporal modeling. Overall, the method highlights the potential of integrating foundation models with graph-based reasoning to enable fast, adaptable segmentation for dynamic machinery and infrastructure contexts.

Abstract

This paper proposes a novel approach to few-shot semantic segmentation for machinery with multiple parts that exhibit spatial and hierarchical relationships. Our method integrates the foundation models CLIPSeg and Segment Anything Model (SAM) with the interest point detector SuperPoint and a graph convolutional network (GCN) to accurately segment machinery parts. By providing 1 to 25 annotated samples, our model, evaluated on a purely synthetic dataset depicting a truck-mounted loading crane, achieves effective segmentation across various levels of detail. Training times are kept under five minutes on consumer GPUs. The model demonstrates robust generalization to real data, achieving a qualitative synthetic-to-real generalization with a score of 92.2 on real data using 10 synthetic support samples. When benchmarked on the DAVIS 2017 dataset, it achieves a score of 71.5 in semi-supervised video segmentation with three support samples. This method's fast training times and effective generalization to real data make it a valuable tool for autonomous systems interacting with machinery and infrastructure, and illustrate the potential of combined and orchestrated foundation models for few-shot segmentation tasks.
Paper Structure (27 sections, 20 figures, 17 tables)

This paper contains 27 sections, 20 figures, 17 tables.

Figures (20)

  • Figure 1: Intermediate steps of our pipeline.
  • Figure 2: System architecture, with all frozen foundation models (yellow) and the novel modules (green). Only the GNN is trained.
  • Figure 3: Top: The three axis of domain randomization: environment HDRi maps, changing camera perspective, and different crane arm articulations. Bottom: Samples of the dataset.
  • Figure 4: Five different annotation granularity levels, ranging from background and truck differentiation up to 22 individual parts.
  • Figure 5: Network architecture of the trainable GCN-based graph classifier.
  • ...and 15 more figures