Table of Contents
Fetching ...

Supervised Representation Learning towards Generalizable Assembly State Recognition

Tim J. Schoonbeek, Goutham Balachandran, Hans Onvlee, Tim Houben, Shao-Hsuan Hung, Jacek Kustra, Peter H. N. de With, Fons van der Sommen

TL;DR

This work reframes assembly state recognition as a representation-learning problem to address scalability and robustness to execution errors, introducing Intermediate-State Informed Loss (ISIL) that uses unlabeled intermediate configurations as negative samples. The authors present a supervised contrastive framework with real, synthetic, and unlabeled data, and demonstrate that ISIL improves clustering and classification across backbones (e.g., ResNet-34, ViT-S) and losses, while enabling generalization to unseen part configurations and unseen error states. Evaluations on the IndustReal dataset show substantial gains in $F_1@1$ and $MAP@R(+)$ over classification baselines, and provide extensive analysis of error-state generalization with new annotations. The approach promises real-time, scalable assembly-state monitoring for industrial settings and sets the stage for further work with real-world data, edge deployment, and weak supervision signals.

Abstract

Assembly state recognition facilitates the execution of assembly procedures, offering feedback to enhance efficiency and minimize errors. However, recognizing assembly states poses challenges in scalability, since parts are frequently updated, and the robustness to execution errors remains underexplored. To address these challenges, this paper proposes an approach based on representation learning and the novel intermediate-state informed loss function modification (ISIL). ISIL leverages unlabeled transitions between states and demonstrates significant improvements in clustering and classification performance for all tested architectures and losses. Despite being trained exclusively on images without execution errors, thorough analysis on error states demonstrates that our approach accurately distinguishes between correct states and states with various types of execution errors. The integration of the proposed algorithm can offer meaningful assistance to workers and mitigate unexpected losses due to procedural mishaps in industrial settings. The code is available at: https://timschoonbeek.github.io/state_rec

Supervised Representation Learning towards Generalizable Assembly State Recognition

TL;DR

This work reframes assembly state recognition as a representation-learning problem to address scalability and robustness to execution errors, introducing Intermediate-State Informed Loss (ISIL) that uses unlabeled intermediate configurations as negative samples. The authors present a supervised contrastive framework with real, synthetic, and unlabeled data, and demonstrate that ISIL improves clustering and classification across backbones (e.g., ResNet-34, ViT-S) and losses, while enabling generalization to unseen part configurations and unseen error states. Evaluations on the IndustReal dataset show substantial gains in and over classification baselines, and provide extensive analysis of error-state generalization with new annotations. The approach promises real-time, scalable assembly-state monitoring for industrial settings and sets the stage for further work with real-world data, edge deployment, and weak supervision signals.

Abstract

Assembly state recognition facilitates the execution of assembly procedures, offering feedback to enhance efficiency and minimize errors. However, recognizing assembly states poses challenges in scalability, since parts are frequently updated, and the robustness to execution errors remains underexplored. To address these challenges, this paper proposes an approach based on representation learning and the novel intermediate-state informed loss function modification (ISIL). ISIL leverages unlabeled transitions between states and demonstrates significant improvements in clustering and classification performance for all tested architectures and losses. Despite being trained exclusively on images without execution errors, thorough analysis on error states demonstrates that our approach accurately distinguishes between correct states and states with various types of execution errors. The integration of the proposed algorithm can offer meaningful assistance to workers and mitigate unexpected losses due to procedural mishaps in industrial settings. The code is available at: https://timschoonbeek.github.io/state_rec
Paper Structure (14 sections, 2 equations, 8 figures)

This paper contains 14 sections, 2 equations, 8 figures.

Figures (8)

  • Figure 1: This work defines assembly state recognition as a representation learning approach (bottom of figure), rather than classification (top), and demonstrates that this (1) outperforms classification-based approaches, (2) avoids re-training the models after every minor update in the assembly procedure, and (3) enables distinction between correct and erroneous states.
  • Figure 2: Illustration of influence of the proposed loss function that leverages images containing intermediate assembly states, i.e. non-defined, transitional states between pre-defined states. In (a), the intermediate states (gray) are ignored during training, not leveraging any information that these images might have. In (b), the intermediate states are grouped into a single cluster, hindering the model's capacity to capture a meaningful embedding since these states are frequently not correlated. We propose (c), an intuitive modification to loss functions that leverages intermediate states exclusively as negative samples. Dissimilar embeddings of (potentially) uncorrelated states are only penalized if they are similar to any foreground (pre-defined) class.
  • Figure 3: Overview of the contrastive learning framework (with the proposed ISIL modification). Each mini-batch consists of real-world and synthetic images of pre-defined assembly states and real-world intermediate states. The images are passed through an encoder $f(\cdot)$, followed by a three-layer MLP projection head $g(\cdot)$. The resulting embeddings $z_i$ are used to calculate the contrastive loss. During inference, only the first layer of $g(\cdot)$ is used.
  • Figure 4: Assembly state recognition performance on IndustReal schoonbeek2024industreal. The contrastive approaches outperform those trained for classification, and the proposed ISIL modification increases performance in all settings.
  • Figure 5: Performance of recognizing entirely unseen states. Contrastive losses outperform the cross-entropy loss for all settings, and the ResNet backbone demonstrates significantly better generalization than the ViT.
  • ...and 3 more figures