Table of Contents
Fetching ...

Yolo-Key-6D: Single Stage Monocular 6D Pose Estimation with Keypoint Enhancements

Kemal Alperen Çetiner, Hazım Kemal Ekenel

TL;DR

This paper presents Yolo-Key-6D, a novel single stage, end-to-end framework for monocular 6D pose estimation designed for both speed and accuracy, and demonstrates that a carefully designed single stage method can provide a practical and effective balance of performance and efficiency for real world deployment.

Abstract

Estimating the 6D pose of objects from a single RGB image is a critical task for robotics and extended reality applications. However, state-of-the-art multi stage methods often suffer from high latency, making them unsuitable for real time use. In this paper, we present Yolo-Key-6D, a novel single stage, end-to-end framework for monocular 6D pose estimation designed for both speed and accuracy. Our approach enhances a YOLO based architecture by integrating an auxiliary head that regresses the 2D projections of an object's 3D bounding box corners. This keypoint detection task significantly improves the network's understanding of 3D geometry. For stable end-to-end training, we directly regress rotation using a continuous 9D representation projected to SO(3) via singular value decomposition. On the LINEMOD and LINEMOD-Occluded benchmarks, YOLO-Key-6D achieves competitive accuracy scores of 96.24% and 69.41%, respectively, with the ADD(-S) 0.1d metric, while proving itself to operate in real time. Our results demonstrate that a carefully designed single stage method can provide a practical and effective balance of performance and efficiency for real world deployment.

Yolo-Key-6D: Single Stage Monocular 6D Pose Estimation with Keypoint Enhancements

TL;DR

This paper presents Yolo-Key-6D, a novel single stage, end-to-end framework for monocular 6D pose estimation designed for both speed and accuracy, and demonstrates that a carefully designed single stage method can provide a practical and effective balance of performance and efficiency for real world deployment.

Abstract

Estimating the 6D pose of objects from a single RGB image is a critical task for robotics and extended reality applications. However, state-of-the-art multi stage methods often suffer from high latency, making them unsuitable for real time use. In this paper, we present Yolo-Key-6D, a novel single stage, end-to-end framework for monocular 6D pose estimation designed for both speed and accuracy. Our approach enhances a YOLO based architecture by integrating an auxiliary head that regresses the 2D projections of an object's 3D bounding box corners. This keypoint detection task significantly improves the network's understanding of 3D geometry. For stable end-to-end training, we directly regress rotation using a continuous 9D representation projected to SO(3) via singular value decomposition. On the LINEMOD and LINEMOD-Occluded benchmarks, YOLO-Key-6D achieves competitive accuracy scores of 96.24% and 69.41%, respectively, with the ADD(-S) 0.1d metric, while proving itself to operate in real time. Our results demonstrate that a carefully designed single stage method can provide a practical and effective balance of performance and efficiency for real world deployment.
Paper Structure (21 sections, 15 equations, 6 figures, 5 tables)

This paper contains 21 sections, 15 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: 3D bounding boxes calculated from the predicted rotation matrices and translation vectors for various objects in the test dataset.
  • Figure 2: Augmented data sample with changed HSV values. Top left image is the original and remaining ones are augmented samples.
  • Figure 3: Object of interest is cutout and the background is replaced with an image from VOC 2012 dataset. Left images are originals and ones on right are augmented images.
  • Figure 4: Image is rotated around the principal axis.
  • Figure 5: Yolov11 combines E-ELAN backbone using group convolutions with FPN and PAN neck layers to create aggregate features. Our model produces Rotation, Depth and Keypoint vectors in addition to usual detection head in order to perform the 6 DoF Pose Estimation task.
  • ...and 1 more figures