Table of Contents
Fetching ...

Detection and Pose Estimation of flat, Texture-less Industry Objects on HoloLens using synthetic Training

Thomas Pöllabauer, Fabian Rücker, Andreas Franek, Felix Gorschlüter

TL;DR

The paper tackles the problem of real-time 6D pose estimation for flat, texture-less industrial objects on edge devices by leveraging synthetic training data derived from manufacturing documents. It presents a client-server AR pipeline that uses YOLOv5 for detection and CosyPose for pose estimation, trained entirely on synthetic renders generated from 2D manufacturing schematics converted into 3D meshes. The approach achieves strong detection recall and competitive pose estimation on real HoloLens 2 data, while acknowledging latency from backend processing and the domain gap between synthetic and real imagery. It also provides a modular framework with clear avenues for improving on-device inference, incorporating frontend tracking, and expanding the dataset to better handle challenging objects and motion conditions.

Abstract

Current state-of-the-art 6d pose estimation is too compute intensive to be deployed on edge devices, such as Microsoft HoloLens (2) or Apple iPad, both used for an increasing number of augmented reality applications. The quality of AR is greatly dependent on its capabilities to detect and overlay geometry within the scene. We propose a synthetically trained client-server-based augmented reality application, demonstrating state-of-the-art object pose estimation of metallic and texture-less industry objects on edge devices. Synthetic data enables training without real photographs, i.e. for yet-to-be-manufactured objects. Our qualitative evaluation on an AR-assisted sorting task, and quantitative evaluation on both renderings, as well as real-world data recorded on HoloLens 2, sheds light on its real-world applicability.

Detection and Pose Estimation of flat, Texture-less Industry Objects on HoloLens using synthetic Training

TL;DR

The paper tackles the problem of real-time 6D pose estimation for flat, texture-less industrial objects on edge devices by leveraging synthetic training data derived from manufacturing documents. It presents a client-server AR pipeline that uses YOLOv5 for detection and CosyPose for pose estimation, trained entirely on synthetic renders generated from 2D manufacturing schematics converted into 3D meshes. The approach achieves strong detection recall and competitive pose estimation on real HoloLens 2 data, while acknowledging latency from backend processing and the domain gap between synthetic and real imagery. It also provides a modular framework with clear avenues for improving on-device inference, incorporating frontend tracking, and expanding the dataset to better handle challenging objects and motion conditions.

Abstract

Current state-of-the-art 6d pose estimation is too compute intensive to be deployed on edge devices, such as Microsoft HoloLens (2) or Apple iPad, both used for an increasing number of augmented reality applications. The quality of AR is greatly dependent on its capabilities to detect and overlay geometry within the scene. We propose a synthetically trained client-server-based augmented reality application, demonstrating state-of-the-art object pose estimation of metallic and texture-less industry objects on edge devices. Synthetic data enables training without real photographs, i.e. for yet-to-be-manufactured objects. Our qualitative evaluation on an AR-assisted sorting task, and quantitative evaluation on both renderings, as well as real-world data recorded on HoloLens 2, sheds light on its real-world applicability.
Paper Structure (21 sections, 5 figures, 5 tables)

This paper contains 21 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Functional blocks in our approach. First we extract the shapes (curves) from the manufacturing documents (Splines). Next, we create meshes from the extracted shapes (Geometry Generation) and use them to create our physically-based, photo-realistic training dataset, as well as our non-photorealistic dataset (Rendering). Combining both datasets, we train our object detector and pose estimation pipeline. At inference, given a real-world camera stream, we process the data image-per-image and get per image detections and 6D pose vectors, which are displayed either on HoloLens 2 or iPad.
  • Figure 2: Samples of our synthetic training data. Note that we make very few assumptions about material or lighting: we randomly choose the coefficients for specular and diffuse reflection, apply random textures, and a random number of (randomly placed) light sources. Also, one sees the effect of mosaic augmentation (as proposed in yolov4).
  • Figure 3: Qualitative results of our end-to-end results on HoloLens 2. Note that we do not use tracking at this point, only single-shot results. We do, however, place the colored objects with regards to the real world coordinate system, which makes them stay in place when the user moves. This makes our solution "real-time capable" although only 3-4 images are processed in the backend. https://www.dropbox.com/s/gasisytmtuvi1sa/TeaserARSorting_v2.mp4?dl=0.
  • Figure 4: Impression of our detection results. We achieve a high recall and good classification (see Figure \ref{['fig:detection_graphs']} for quantified results). We note 2 problems: First, with very flat viewing points performance drops drastically (rightmost image). Second, some camera movements lead to strong blurring and to the detector finding next to (or absolutely) nothing. https://www.dropbox.com/s/03gvn69vdrs53zs/Qualitative%20detection%20results2.mp4?dl=0.
  • Figure 5: Precision-recall curves per object and confusion matrices of our objects. As seen in Figure \ref{['fig:ObjectDetectionResults']} for most objects, the network learns a meaningful representation in only a few epochs, but still improves even close to 500 epochs. Especially the worst performers, such as obj14, improve significantly between epoch 80 and epoch 500. The confusion matrices show that our model usually errs on the side of not detecting an object occurrence (false negative) and much less so on providing erroneous detections (false positives). There are hardly any mix-ups between objects (errors in object class). Evaluated on real-world data.