Table of Contents
Fetching ...

ASDF: Assembly State Detection Utilizing Late Fusion by Integrating 6D Pose Estimation

Hannah Schieber, Shiyu Li, Niklas Corell, Philipp Beckerle, Julian Kreimeier, Daniel Roth

TL;DR

ASDF addresses the challenge of real-time assembly guidance by unifying 6D pose estimation with assembly state detection through a late-fusion Pose2State module. Built on YOLOv8Pose, it refines pose with depth-based translation refinement and incorporates pose-derived state cues, trained on a fully synthetic asdf dataset with real-world evaluation. Results show that the joint pose-state approach improves both state detection (F1) and 6D pose accuracy (ADD/ADD-S and translation error), outperforming baselines on the asdf and GBOT datasets. The work demonstrates the practical potential of integrated pose/state reasoning for robust in-situ AR guidance in medical and industrial assembly tasks.

Abstract

In medical and industrial domains, providing guidance for assembly processes can be critical to ensure efficiency and safety. Errors in assembly can lead to significant consequences such as extended surgery times and prolonged manufacturing or maintenance times in industry. Assembly scenarios can benefit from in-situ augmented reality visualization, i.e., augmentations in close proximity to the target object, to provide guidance, reduce assembly times, and minimize errors. In order to enable in-situ visualization, 6D pose estimation can be leveraged to identify the correct location for an augmentation. Existing 6D pose estimation techniques primarily focus on individual objects and static captures. However, assembly scenarios have various dynamics, including occlusion during assembly and dynamics in the appearance of assembly objects. Existing work focus either on object detection combined with state detection, or focus purely on the pose estimation. To address the challenges of 6D pose estimation in combination with assembly state detection, our approach ASDF builds upon the strengths of YOLOv8, a real-time capable object detection framework. We extend this framework, refine the object pose, and fuse pose knowledge with network-detected pose information. Utilizing our late fusion in our Pose2State module results in refined 6D pose estimation and assembly state detection. By combining both pose and state information, our Pose2State module predicts the final assembly state with precision. The evaluation of our ASDF dataset shows that our Pose2State module leads to an improved assembly state detection and that the improvement of the assembly state further leads to a more robust 6D pose estimation. Moreover, on the GBOT dataset, we outperform the pure deep learning-based network and even outperform the hybrid and pure tracking-based approaches.

ASDF: Assembly State Detection Utilizing Late Fusion by Integrating 6D Pose Estimation

TL;DR

ASDF addresses the challenge of real-time assembly guidance by unifying 6D pose estimation with assembly state detection through a late-fusion Pose2State module. Built on YOLOv8Pose, it refines pose with depth-based translation refinement and incorporates pose-derived state cues, trained on a fully synthetic asdf dataset with real-world evaluation. Results show that the joint pose-state approach improves both state detection (F1) and 6D pose accuracy (ADD/ADD-S and translation error), outperforming baselines on the asdf and GBOT datasets. The work demonstrates the practical potential of integrated pose/state reasoning for robust in-situ AR guidance in medical and industrial assembly tasks.

Abstract

In medical and industrial domains, providing guidance for assembly processes can be critical to ensure efficiency and safety. Errors in assembly can lead to significant consequences such as extended surgery times and prolonged manufacturing or maintenance times in industry. Assembly scenarios can benefit from in-situ augmented reality visualization, i.e., augmentations in close proximity to the target object, to provide guidance, reduce assembly times, and minimize errors. In order to enable in-situ visualization, 6D pose estimation can be leveraged to identify the correct location for an augmentation. Existing 6D pose estimation techniques primarily focus on individual objects and static captures. However, assembly scenarios have various dynamics, including occlusion during assembly and dynamics in the appearance of assembly objects. Existing work focus either on object detection combined with state detection, or focus purely on the pose estimation. To address the challenges of 6D pose estimation in combination with assembly state detection, our approach ASDF builds upon the strengths of YOLOv8, a real-time capable object detection framework. We extend this framework, refine the object pose, and fuse pose knowledge with network-detected pose information. Utilizing our late fusion in our Pose2State module results in refined 6D pose estimation and assembly state detection. By combining both pose and state information, our Pose2State module predicts the final assembly state with precision. The evaluation of our ASDF dataset shows that our Pose2State module leads to an improved assembly state detection and that the improvement of the assembly state further leads to a more robust 6D pose estimation. Moreover, on the GBOT dataset, we outperform the pure deep learning-based network and even outperform the hybrid and pure tracking-based approaches.
Paper Structure (31 sections, 6 equations, 6 figures, 5 tables)

This paper contains 31 sections, 6 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Architecture of asdf. We highlight our contribution in green. asdf utilizes RGB and depth data. The RGB images are fed into the image backbone and the depth data is used to refine the object poses ([$Rt$]). The image backbone predicts the state ($s$) based on the RGB image. In the Translation Refinement module, the translation offset is calculated. Using the relative pose between the assemblies in the assembly group, we predict a second state assumption in our Pose-based Assembly Detection module. In our final Pose2State module, we weight the individual state predictions to predict the one with the highest probability.
  • Figure 2: Assembly state complexity of the asdf dataset (left) and training images of the asdf dataset (right). For training we use assembled and unassembled data (right) and additionally provide hand occlusion (top-right), varying background and light conditions as well as distracting objects. We include the state information in our ground truth labels. An example of the state complexity can be seen in the left figure.
  • Figure 3: Example images with highlighted ground truth of our asdf test set. Synthetic image (left) and real-world image (right). The ground truth of each currently evaluated assembly group is visualized with colorful overlays.
  • Figure 4: Example of the results on the asdf test set. We show the performance of asdf compared to yolov8Pose + Assembly Detection on real-world captures (top two lines) and synthetic renderings (bottom line). The ground truth (left), the pure YOLOv8-based pose and state prediction (center) and our prediction using asdf (right). The current predicted state is denoted in every top-left corner and the pose is shown with a colorful overlays.
  • Figure 5: Example comparison of the translation offset using yolov8Pose and asdf.yolov8Pose shows an offset in 3D (left, yellow) while the translation refinement of our asdf (right, blue) can address this shift.
  • ...and 1 more figures