Table of Contents
Fetching ...

FAST GDRNPP: Improving the Speed of State-of-the-Art 6D Object Pose Estimation

Thomas Pöllabauer, Ashwin Pramod, Volker Knauthe, Michael Wahl

TL;DR

The paper tackles the real-world need for real-time 6D pose estimation by accelerating the state-of-the-art GDRNPP through backbone downsizing, structured pruning, and knowledge distillation. It systematically evaluates backbone options, region-based attention, and the two pruning targets, culminating in two fast configurations (N and P) that dramatically reduce latency while preserving accuracy to within a few percent on core benchmarks. Distillation provides additional gains for smaller backbones, though it adds training complexity, while pruning the geometric head yields the largest speedups with modest AR losses. Collectively, the approach delivers near real-time inference on many core datasets, enabling practical deployment in industrial tasks like bin picking and robotic manipulation.

Abstract

6D object pose estimation involves determining the three-dimensional translation and rotation of an object within a scene and relative to a chosen coordinate system. This problem is of particular interest for many practical applications in industrial tasks such as quality control, bin picking, and robotic manipulation, where both speed and accuracy are critical for real-world deployment. Current models, both classical and deep-learning-based, often struggle with the trade-off between accuracy and latency. Our research focuses on enhancing the speed of a prominent state-of-the-art deep learning model, GDRNPP, while keeping its high accuracy. We employ several techniques to reduce the model size and improve inference time. These techniques include using smaller and quicker backbones, pruning unnecessary parameters, and distillation to transfer knowledge from a large, high-performing model to a smaller, more efficient student model. Our findings demonstrate that the proposed configuration maintains accuracy comparable to the state-of-the-art while significantly improving inference time. This advancement could lead to more efficient and practical applications in various industrial scenarios, thereby enhancing the overall applicability of 6D Object Pose Estimation models in real-world settings.

FAST GDRNPP: Improving the Speed of State-of-the-Art 6D Object Pose Estimation

TL;DR

The paper tackles the real-world need for real-time 6D pose estimation by accelerating the state-of-the-art GDRNPP through backbone downsizing, structured pruning, and knowledge distillation. It systematically evaluates backbone options, region-based attention, and the two pruning targets, culminating in two fast configurations (N and P) that dramatically reduce latency while preserving accuracy to within a few percent on core benchmarks. Distillation provides additional gains for smaller backbones, though it adds training complexity, while pruning the geometric head yields the largest speedups with modest AR losses. Collectively, the approach delivers near real-time inference on many core datasets, enabling practical deployment in industrial tasks like bin picking and robotic manipulation.

Abstract

6D object pose estimation involves determining the three-dimensional translation and rotation of an object within a scene and relative to a chosen coordinate system. This problem is of particular interest for many practical applications in industrial tasks such as quality control, bin picking, and robotic manipulation, where both speed and accuracy are critical for real-world deployment. Current models, both classical and deep-learning-based, often struggle with the trade-off between accuracy and latency. Our research focuses on enhancing the speed of a prominent state-of-the-art deep learning model, GDRNPP, while keeping its high accuracy. We employ several techniques to reduce the model size and improve inference time. These techniques include using smaller and quicker backbones, pruning unnecessary parameters, and distillation to transfer knowledge from a large, high-performing model to a smaller, more efficient student model. Our findings demonstrate that the proposed configuration maintains accuracy comparable to the state-of-the-art while significantly improving inference time. This advancement could lead to more efficient and practical applications in various industrial scenarios, thereby enhancing the overall applicability of 6D Object Pose Estimation models in real-world settings.
Paper Structure (23 sections, 8 equations, 14 figures, 6 tables)

This paper contains 23 sections, 8 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: GDR-Net as presented in wang2021gdr: Given an RGB image, the model takes a zoomed-in Region of Interest (RoI) of the image as input (1st module). A geometric decoder head then predicts intermediate geometric feature maps (2nd module) $M_{SRA}$, $M_{vis}$, $M_{2D-3D}$. These features are fed into the Patch-P$n$P module (3rd module) to regress the rotation and translation.
  • Figure 2: Illustration of $L_1$-norm filter pruning as found in li2022pruning. $F_{i,j}$ is the filter activation on layer $x_{i}$. Removal of a filter directly propagates through the network, leading to additional filters being removed further down.
  • Figure 3: Illustration of $L_1$ norm filter pruning applied to the geometric head. The layer architecture is adjusted by $8 \times D$.
  • Figure 4: Illustration of $L_1$ norm filter pruning applied to the Patch P$n$P module. The layer architecture is adjusted by $4 \times D$.
  • Figure 5: Performance of various backbones when integrated into GDRNPP.
  • ...and 9 more figures