Table of Contents
Fetching ...

EfficientPose: An efficient, accurate and scalable end-to-end 6D multi object pose estimation approach

Yannick Bukschat, Marcus Vetter

TL;DR

EfficientPose tackles real-time 6D pose estimation for multiple objects in RGB images by extending EfficientDet with rotation and translation subnetworks and introducing a 6D augmentation to improve generalization. It enables end-to-end, single-shot multi-object pose estimation without per-object PnP or RANSAC post-processing, while maintaining scalability via a single hyperparameter $\phi$. On Linemod, it achieves state-of-the-art ADD(-S) alongside real-time performance (over 27 FPS), and demonstrates robust multi-object capability on Occlusion. This work narrows the gap between direct 6D pose estimation and 2D+PnP pipelines, delivering practical impact for robotics, autonomous systems, and augmented reality.

Abstract

In this paper we introduce EfficientPose, a new approach for 6D object pose estimation. Our method is highly accurate, efficient and scalable over a wide range of computational resources. Moreover, it can detect the 2D bounding box of multiple objects and instances as well as estimate their full 6D poses in a single shot. This eliminates the significant increase in runtime when dealing with multiple objects other approaches suffer from. These approaches aim to first detect 2D targets, e.g. keypoints, and solve a Perspective-n-Point problem for their 6D pose for each object afterwards. We also propose a novel augmentation method for direct 6D pose estimation approaches to improve performance and generalization, called 6D augmentation. Our approach achieves a new state-of-the-art accuracy of 97.35% in terms of the ADD(-S) metric on the widely-used 6D pose estimation benchmark dataset Linemod using RGB input, while still running end-to-end at over 27 FPS. Through the inherent handling of multiple objects and instances and the fused single shot 2D object detection as well as 6D pose estimation, our approach runs even with multiple objects (eight) end-to-end at over 26 FPS, making it highly attractive to many real world scenarios. Code will be made publicly available at https://github.com/ybkscht/EfficientPose.

EfficientPose: An efficient, accurate and scalable end-to-end 6D multi object pose estimation approach

TL;DR

EfficientPose tackles real-time 6D pose estimation for multiple objects in RGB images by extending EfficientDet with rotation and translation subnetworks and introducing a 6D augmentation to improve generalization. It enables end-to-end, single-shot multi-object pose estimation without per-object PnP or RANSAC post-processing, while maintaining scalability via a single hyperparameter . On Linemod, it achieves state-of-the-art ADD(-S) alongside real-time performance (over 27 FPS), and demonstrates robust multi-object capability on Occlusion. This work narrows the gap between direct 6D pose estimation and 2D+PnP pipelines, delivering practical impact for robotics, autonomous systems, and augmented reality.

Abstract

In this paper we introduce EfficientPose, a new approach for 6D object pose estimation. Our method is highly accurate, efficient and scalable over a wide range of computational resources. Moreover, it can detect the 2D bounding box of multiple objects and instances as well as estimate their full 6D poses in a single shot. This eliminates the significant increase in runtime when dealing with multiple objects other approaches suffer from. These approaches aim to first detect 2D targets, e.g. keypoints, and solve a Perspective-n-Point problem for their 6D pose for each object afterwards. We also propose a novel augmentation method for direct 6D pose estimation approaches to improve performance and generalization, called 6D augmentation. Our approach achieves a new state-of-the-art accuracy of 97.35% in terms of the ADD(-S) metric on the widely-used 6D pose estimation benchmark dataset Linemod using RGB input, while still running end-to-end at over 27 FPS. Through the inherent handling of multiple objects and instances and the fused single shot 2D object detection as well as 6D pose estimation, our approach runs even with multiple objects (eight) end-to-end at over 26 FPS, making it highly attractive to many real world scenarios. Code will be made publicly available at https://github.com/ybkscht/EfficientPose.

Paper Structure

This paper contains 23 sections, 17 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Top: Example prediction for qualitative evaluation of our $\phi = 0$ model performing single shot 6D multi object pose estimation on the Occlusion test set while running end-to-end at over 26 FPS. Green 3D bounding boxes visualize ground truth poses while our estimated poses are represented by the other colors. Bottom: Average end-to-end runtimes in FPS of our $\phi = 0$ and $\phi = 3$ model on the Occlusion test set w.r.t. the number of objects per image. Shaded areas represent the standard deviations.
  • Figure 2: Schematic representation of our EfficientPose architecture including the EfficientNetEfficientNet backbone, the bidirectional feature pyramid network (BiFPN) and the prediction subnetworks.
  • Figure 3: Rotation network architecture with the initial regression and iterative refinement module. Each conv block consists of a depthwise separable convolution layer followed by group normalization and SiLU activation.
  • Figure 4: Architecture of the rotation refinement module. Each conv block consists of a depthwise separable convolution layer followed by group normalization and SiLU activation.
  • Figure 5: Illustration of the 2D center point estimation process. The target for each point in the feature map is the offset from the current location to the object's center point.
  • ...and 4 more figures