Table of Contents
Fetching ...

Learning Embeddings with Centroid Triplet Loss for Object Identification in Robotic Grasping

Anas Gouda, Max Schwarz, Christopher Reining, Sven Behnke, Alice Kirchheim

TL;DR

The paper tackles unseen object identification in robotic grasping under variable input sizes by learning embeddings with Centroid Triplet Loss (CTL) that aggregate all views of an object into a single centroid. CTL is trained on ARMBench and allows arbitrary numbers of query/gallery images, with the loss defined as $L_{triplet} = \max( ||C_a - C_p||_2 - ||C_a - C_n||_2 + \alpha, 0)$ to enforce inter-centroid separation; the method is integrated with a zero-shot segmentation module (SAM) to produce object segments that are matched to gallery images, forming a complete unseen-object segmentation and identification pipeline. Empirical results show state-of-the-art ARMBench performance and competitive HOPE results, highlighting the practical applicability to warehouse automation and robust recognition across unseen instances. The work provides a scalable, multi-image, appearance-robust approach to object identification that can enhance downstream 2D segmentation and 6D localization in real-world robotic systems.

Abstract

Foundation models are a strong trend in deep learning and computer vision. These models serve as a base for applications as they require minor or no further fine-tuning by developers to integrate into their applications. Foundation models for zero-shot object segmentation such as Segment Anything (SAM) output segmentation masks from images without any further object information. When they are followed in a pipeline by an object identification model, they can perform object detection without training. Here, we focus on training such an object identification model. A crucial practical aspect for an object identification model is to be flexible in input size. As object identification is an image retrieval problem, a suitable method should handle multi-query multi-gallery situations without constraining the number of input images (e.g. by having fixed-size aggregation layers). The key solution to train such a model is the centroid triplet loss (CTL), which aggregates image features to their centroids. CTL yields high accuracy, avoids misleading training signals and keeps the model input size flexible. In our experiments, we establish a new state of the art on the ArmBench object identification task, which shows general applicability of our model. We furthermore demonstrate an integrated unseen object detection pipeline on the challenging HOPE dataset, which requires fine-grained detection. There, our pipeline matches and surpasses related methods which have been trained on dataset-specific data.

Learning Embeddings with Centroid Triplet Loss for Object Identification in Robotic Grasping

TL;DR

The paper tackles unseen object identification in robotic grasping under variable input sizes by learning embeddings with Centroid Triplet Loss (CTL) that aggregate all views of an object into a single centroid. CTL is trained on ARMBench and allows arbitrary numbers of query/gallery images, with the loss defined as to enforce inter-centroid separation; the method is integrated with a zero-shot segmentation module (SAM) to produce object segments that are matched to gallery images, forming a complete unseen-object segmentation and identification pipeline. Empirical results show state-of-the-art ARMBench performance and competitive HOPE results, highlighting the practical applicability to warehouse automation and robust recognition across unseen instances. The work provides a scalable, multi-image, appearance-robust approach to object identification that can enhance downstream 2D segmentation and 6D localization in real-world robotic systems.

Abstract

Foundation models are a strong trend in deep learning and computer vision. These models serve as a base for applications as they require minor or no further fine-tuning by developers to integrate into their applications. Foundation models for zero-shot object segmentation such as Segment Anything (SAM) output segmentation masks from images without any further object information. When they are followed in a pipeline by an object identification model, they can perform object detection without training. Here, we focus on training such an object identification model. A crucial practical aspect for an object identification model is to be flexible in input size. As object identification is an image retrieval problem, a suitable method should handle multi-query multi-gallery situations without constraining the number of input images (e.g. by having fixed-size aggregation layers). The key solution to train such a model is the centroid triplet loss (CTL), which aggregates image features to their centroids. CTL yields high accuracy, avoids misleading training signals and keeps the model input size flexible. In our experiments, we establish a new state of the art on the ArmBench object identification task, which shows general applicability of our model. We furthermore demonstrate an integrated unseen object detection pipeline on the challenging HOPE dataset, which requires fine-grained detection. There, our pipeline matches and surpasses related methods which have been trained on dataset-specific data.
Paper Structure (9 sections, 2 equations, 7 figures, 3 tables)

This paper contains 9 sections, 2 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overview of our method. We train an object identification backbone on the large-scale ARMBench ID dataset using the centroid triplet loss (CTL unreasonable_ctl), such that the backbone learns to associate query images of objects in cluttered containers to matching gallery images. Crucially, the CTL loss operates on centroids in feature space, allowing the aggregation of an arbitrary number of input images. The trained backbone can then be used for identification of unseen objects that were segmented using a generic object segmentation method such as Segment Anything (SAM), given corresponding gallery images.
  • Figure 2: Typical stages for handling unseen objects. Zero-shot segmentation and object identification/matching represent by themselves a pipeline for 2D segmentation of unseen objects. Another stage is then optionally added to perform 6D pose estimation of unseen objects.
  • Figure 3: Full unseen object detector pipeline. A zero-shot segmentation method removes the background and produces object segments, which may be over- or undersegmented and can overlap. Features are extracted by the identification backbone on both gallery images and segments. After finding closest matches in feature space using cosine similarity, badly or un-matched items are rejected by a thresholding operation. Finally, the matches can be used to construct a full instance segmentation of the scene.
  • Figure 4: Example objects from ARMBench that have gallery images differing largely in texture. Treating each image on its own as a possible match loses a valuable training signal---these belong together. Using CTL, we treat objects as a whole.
  • Figure 5: Batch computations during training. Each batch is filled with triplets until the batch size is reached. Backbone feature vectors are extracted in batched fashion. The resulting features are then aggregated to their corresponding centroid using an index-add operation. Finally, the losses are applied.
  • ...and 2 more figures