Deep Learning-Based Object Pose Estimation: A Comprehensive Survey

Jian Liu; Wei Sun; Hui Yang; Zhiwen Zeng; Chongpei Liu; Jin Zheng; Xingyu Liu; Hossein Rahmani; Nicu Sebe; Ajmal Mian

Deep Learning-Based Object Pose Estimation: A Comprehensive Survey

Jian Liu, Wei Sun, Hui Yang, Zhiwen Zeng, Chongpei Liu, Jin Zheng, Xingyu Liu, Hossein Rahmani, Nicu Sebe, Ajmal Mian

TL;DR

<3-5 sentence high-level summary> This survey comprehensively catalogs deep learning approaches to object pose estimation across three problem formulations: instance-level, category-level, and unseen objects. It organizes methods by input modality, output DoF, and object properties, and it surveys training paradigms, inference modes, and benchmarks, highlighting state-of-the-art trends and remaining challenges. The authors emphasize data efficiency, domain transfer, and generalization, offering concrete directions such as weak/self-supervised learning, synthetic-to-real adaptation, end-to-end deployment, and open-vocabulary generalization. By mapping datasets, metrics, and representative methods across the spectrum, the work guides practitioners in selecting suitable techniques for diverse robotics, AR/VR, and autonomous systems tasks.

Abstract

Object pose estimation is a fundamental computer vision problem with broad applications in augmented reality and robotics. Over the past decade, deep learning models, due to their superior accuracy and robustness, have increasingly supplanted conventional algorithms reliant on engineered point pair features. Nevertheless, several challenges persist in contemporary methods, including their dependency on labeled training data, model compactness, robustness under challenging conditions, and their ability to generalize to novel unseen objects. A recent survey discussing the progress made on different aspects of this area, outstanding challenges, and promising future directions, is missing. To fill this gap, we discuss the recent advances in deep learning-based object pose estimation, covering all three formulations of the problem, \emph{i.e.}, instance-level, category-level, and unseen object pose estimation. Our survey also covers multiple input data modalities, degrees-of-freedom of output poses, object properties, and downstream tasks, providing the readers with a holistic understanding of this field. Additionally, it discusses training paradigms of different domains, inference modes, application areas, evaluation metrics, and benchmark datasets, as well as reports the performance of current state-of-the-art methods on these benchmarks, thereby facilitating the readers in selecting the most suitable method for their application. Finally, the survey identifies key challenges, reviews the prevailing trends along with their pros and cons, and identifies promising directions for future research. We also keep tracing the latest works at https://github.com/CNJianLiu/Awesome-Object-Pose-Estimation.

Deep Learning-Based Object Pose Estimation: A Comprehensive Survey

TL;DR

Abstract

Paper Structure (52 sections, 10 equations, 7 figures, 4 tables)

This paper contains 52 sections, 10 equations, 7 figures, 4 tables.

Introduction
Datasets and Metrics
Datasets for Instance-Level Methods
BOP Challenge Datasets
Other Datasets
Datasets for Category-Level Methods
Rigid Objects Datasets
Articulated Objects Datasets
Datasets for Unseen Methods
Metrics
3DoF Evaluation Metrics
6DoF Evaluation Metrics
9DoF Evaluation Metric
Other Metric
Instance-Level Object Pose Estimation
...and 37 more sections

Figures (7)

Figure 1: Comparison of instance-level, category-level, and unseen object methods. Instance-level methods can only estimate the pose of specific object instances on which they are trained. Category-level methods can infer intra-class unseen instances rather than being limited to specific instances in the training data. In contrast, unseen object pose estimation methods have stronger generalization ability and can handle object categories not encountered during training.
Figure 2: A taxonomy of this survey. Firstly, we review the datasets and evaluation metrics used to evaluate object pose estimation. Next, we review the deep learning-based methods by dividing them into three categories: instance-level, category-level, and unseen methods. Instance-level methods can be further classified into correspondence-based, template-based, voting-based, and regression-based methods. Category-level methods can be further divided into shape prior-based and shape prior-free methods. Unseen methods can be further classified into CAD model-based and manual reference view-based methods.
Figure 3: Chronological overview of the datasets for object pose estimation evaluation. Notably, the pink arrows represent the BOP Challenge datasets, which can be used to evaluate both instance-level and unseen object methods. The red references represent the datasets of articulated objects. From this, we can also see the development trend in the field of object pose estimation, i.e., from instance-level methods to category-level and unseen methods.
Figure 4: Illustration of the correspondence-based (Sec. \ref{['Correspondence-Based Methods']}), template-based (Sec. \ref{['Template-Based Methods']}), voting-based (Sec. \ref{['Voting-Based Methods']}), and regression-based (Sec. \ref{['Regression-Based Methods']}) instance-level methods. Correspondence-based methods (Sec. \ref{['Correspondence-Based Methods']}) involve establishing correspondences between input data and a provided object CAD model. Template-based methods (Sec. \ref{['Template-Based Methods']}) involve identifying the most similar template from a set of templates labeled with ground-truth object poses. Voting-based methods (Sec. \ref{['Voting-Based Methods']}) determine object pose through a pixel-level or point-level voting scheme. Regression-based methods (Sec. \ref{['Regression-Based Methods']}) aim to obtain the object pose directly from the learned features.
Figure 5: Illustration of the shape prior-based (Sec. \ref{['Shape Prior-Based Methods']}) and shape prior-free (Sec. \ref{['Shape Prior-Free Methods']}) category-level methods. The dashed arrows indicate offline training, which means that we need to train a model offline using the category-level model library to obtain shape priors. (Sec. \ref{['Shape Prior-Based Methods']}): Taking RGBD input as an example, NOCS shape alignment methods (Sec. \ref{['NOCS Shape Alignment Methods']}) first learn a model to predict the NOCS shape/map of the object, and then align the object point cloud with the NOCS shape/map through a non-differentiable pose solution method such as the Umeyama algorithm Umeyama to solve the object pose. In contrast, direct regress pose methods (Sec. \ref{['Direct Regress Pose Methods']}) directly regress the object pose from the extracted input features. On the other hand, the shape prior-free methods (Sec. \ref{['Shape Prior-Free Methods']}) do not have the process of shape priors regression: Depth-guided geometry-aware methods (Sec. \ref{['Depth-Guided Geometry-Aware Methods']}) focus on perceiving the global and local geometric information of the object and leverage these 3D geometric features to estimate the object pose. Conversely, RGBD-guided semantic and geometry fusion methods (Sec. \ref{['RGBD-Guided Semantic and Geometry Fusion Methods']}) regress the object pose by fusing the 2D semantic and 3D geometric information of the object.
...and 2 more figures

Deep Learning-Based Object Pose Estimation: A Comprehensive Survey

TL;DR

Abstract

Deep Learning-Based Object Pose Estimation: A Comprehensive Survey

Authors

TL;DR

Abstract

Table of Contents

Figures (7)