Table of Contents
Fetching ...

Good Grasps Only: A data engine for self-supervised fine-tuning of pose estimation using grasp poses for verification

Frederik Hagelskjær

TL;DR

This work introduces a data engine for online, self-supervised fine-tuning of robot pose estimation in bin-picking. It fuses zero-shot pose estimation (KeyMatchNet) with in-hand pose verification to automatically generate labeled data during task execution, enabling continuous improvement without a separate training phase. Experiments on four cylindrical objects show that the self-supervised loop outperforms a CAD-trained baseline and generalizes to unseen objects, while maintaining robust grasping and enabling improved insertion. The approach reduces setup time and offers a practical path toward adaptable, self-tuning robotic manipulation in flexible manufacturing settings.

Abstract

In this paper, we present a novel method for self-supervised fine-tuning of pose estimation. Leveraging zero-shot pose estimation, our approach enables the robot to automatically obtain training data without manual labeling. After pose estimation the object is grasped, and in-hand pose estimation is used for data validation. Our pipeline allows the system to fine-tune while the process is running, removing the need for a learning phase. The motivation behind our work lies in the need for rapid setup of pose estimation solutions. Specifically, we address the challenging task of bin picking, which plays a pivotal role in flexible robotic setups. Our method is implemented on a robotics work-cell, and tested with four different objects. For all objects, our method increases the performance and outperforms a state-of-the-art method trained on the CAD model of the objects. Project page available at gogoengine.github.io

Good Grasps Only: A data engine for self-supervised fine-tuning of pose estimation using grasp poses for verification

TL;DR

This work introduces a data engine for online, self-supervised fine-tuning of robot pose estimation in bin-picking. It fuses zero-shot pose estimation (KeyMatchNet) with in-hand pose verification to automatically generate labeled data during task execution, enabling continuous improvement without a separate training phase. Experiments on four cylindrical objects show that the self-supervised loop outperforms a CAD-trained baseline and generalizes to unseen objects, while maintaining robust grasping and enabling improved insertion. The approach reduces setup time and offers a practical path toward adaptable, self-tuning robotic manipulation in flexible manufacturing settings.

Abstract

In this paper, we present a novel method for self-supervised fine-tuning of pose estimation. Leveraging zero-shot pose estimation, our approach enables the robot to automatically obtain training data without manual labeling. After pose estimation the object is grasped, and in-hand pose estimation is used for data validation. Our pipeline allows the system to fine-tune while the process is running, removing the need for a learning phase. The motivation behind our work lies in the need for rapid setup of pose estimation solutions. Specifically, we address the challenging task of bin picking, which plays a pivotal role in flexible robotic setups. Our method is implemented on a robotics work-cell, and tested with four different objects. For all objects, our method increases the performance and outperforms a state-of-the-art method trained on the CAD model of the objects. Project page available at gogoengine.github.io
Paper Structure (20 sections, 3 equations, 7 figures, 3 tables)

This paper contains 20 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The pipeline of our developed data engine. First using zero-shot a pose estimation is performed. Then the object is grasped and an in-hand pose estimation is performed. Comparing the two poses, correct pose estimations are sorted, and the network is fine-tuned. A new pose estimation is then performed and the process repeats. As data is gradually collected the network performance increases.
  • Figure 2: The workcell used for experiments. The real workcell is shown in Fig. \ref{['fig:workcell:01']} with a digital twin shown in Fig. \ref{['fig:workcell:02']}. The digital twin allows for planning collision-free movements, as robot the movements are dependent on the found object poses. A top view is shown in Fig. \ref{['fig:workcell:03']}. The object bins are placed in the center. At the bottom left the background light for the in-hand pose estimation is located. The fixture is shown at the top right.
  • Figure 3: Insertion of Novo A into the fixture.
  • Figure 4: Examples of different grasps as shown from the in-hand vision system. Image a) and b) where successfully inserted, while c) and d) could not.
  • Figure 5: The layout of the database used for collecting the data. The database structure mimics the flow of the task allowing for a simple relationship between pose estimations and the resulting insertions.
  • ...and 2 more figures