Table of Contents
Fetching ...

DOR3D-Net: Dense Ordinal Regression Network for 3D Hand Pose Estimation

Yamin Mao, Zhihua Liu, Weiming Li, SoonYong Cho, Qiang Wang, Xiaoshuai Hao

TL;DR

This work re-formulates 3D hand pose estimation as a dense ordinal regression problem and proposes a novel Dense Ordinal Regression 3D Pose Network (DOR3D-Net), which provides significant improvements over SOTA methods.

Abstract

Depth-based 3D hand pose estimation is an important but challenging research task in human-machine interaction community. Recently, dense regression methods have attracted increasing attention in 3D hand pose estimation task, which provide a low computational burden and high accuracy regression way by densely regressing hand joint offset maps. However, large-scale regression offset values are often affected by noise and outliers, leading to a significant drop in accuracy. To tackle this, we re-formulate 3D hand pose estimation as a dense ordinal regression problem and propose a novel Dense Ordinal Regression 3D Pose Network (DOR3D-Net). Specifically, we first decompose offset value regression into sub-tasks of binary classifications with ordinal constraints. Then, each binary classifier can predict the probability of a binary spatial relationship relative to joint, which is easier to train and yield much lower level of noise. The estimated hand joint positions are inferred by aggregating the ordinal regression results at local positions with a weighted sum. Furthermore, both joint regression loss and ordinal regression loss are used to train our DOR3D-Net in an end-to-end manner. Extensive experiments on public datasets (ICVL, MSRA, NYU and HANDS2017) show that our design provides significant improvements over SOTA methods.

DOR3D-Net: Dense Ordinal Regression Network for 3D Hand Pose Estimation

TL;DR

This work re-formulates 3D hand pose estimation as a dense ordinal regression problem and proposes a novel Dense Ordinal Regression 3D Pose Network (DOR3D-Net), which provides significant improvements over SOTA methods.

Abstract

Depth-based 3D hand pose estimation is an important but challenging research task in human-machine interaction community. Recently, dense regression methods have attracted increasing attention in 3D hand pose estimation task, which provide a low computational burden and high accuracy regression way by densely regressing hand joint offset maps. However, large-scale regression offset values are often affected by noise and outliers, leading to a significant drop in accuracy. To tackle this, we re-formulate 3D hand pose estimation as a dense ordinal regression problem and propose a novel Dense Ordinal Regression 3D Pose Network (DOR3D-Net). Specifically, we first decompose offset value regression into sub-tasks of binary classifications with ordinal constraints. Then, each binary classifier can predict the probability of a binary spatial relationship relative to joint, which is easier to train and yield much lower level of noise. The estimated hand joint positions are inferred by aggregating the ordinal regression results at local positions with a weighted sum. Furthermore, both joint regression loss and ordinal regression loss are used to train our DOR3D-Net in an end-to-end manner. Extensive experiments on public datasets (ICVL, MSRA, NYU and HANDS2017) show that our design provides significant improvements over SOTA methods.
Paper Structure (13 sections, 9 equations, 8 figures, 6 tables)

This paper contains 13 sections, 9 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Visualization of final and intermediate results for comparison between SOTA methods and our method. Row $1$ and $2$ show predictions from A2J A2J and JGR-P2O JGR respectively and row $3$ shows our predictions. For each row, column $1$ shows the final result of prediction for an exemplar hand joint (superimposed on input depth image) and column $2$ and $3$ show the $x$-offset and $y$-offset maps respectively. For better comparison, bright and light yellow dots surrounded by red circles show the predict joint and ground truth respectively. Notice that several error areas present in the offset maps from A2J and JGR-P2O (highlighted with red and yellow boxes). In contrast, our probability map is clean. This is brought by our dense ordinal regression design and empowers our DOR3D-Net to surpass SOTA methods in public benchmarks.
  • Figure 2: The pipeline of our transformer-based feature extractor. It contains patch partition and four Swin Transformer stages. Patch partition splits the image into multiple $4\times4$ patches and each patch is considered as a token. Then tokens pass through each stage to learn long-range feature interactions through Swin Transformer blocks. The final two feature maps from the last stage are sent into the dense ordinal regression module for 3D hand pose prediction.
  • Figure 3: The pipeline of our proposed dense ordinal regression module. The inputs are two feature maps. With the reshape and softmax operators, we obtain binary probability maps. With weighted sum, the binary probabilities at local positions are aggregated to infer hand keypoints along each of the three dimensions respectively. Supervised by dense ordinal regression loss, these binary classifiers are easier to train and yield much lower level of noise, which helps to estimate accurate 3D hand joint poses.
  • Figure 4: Visualization of the proposed $x$- and $z$-discretization process. $x$-axis uses uniform discretization and $z$-axis applies normal discretization. For the $x$-probability map, each column represents the probability that the keypoint is larger than the corresponding discretization threshold. For the $z$-probability map, each map represents the probability that the keypoint is larger than the corresponding discretization threshold.
  • Figure 5: Comparison with the state-of-the-art methods on MSRA, ICVL, and NYU dataset. Top: The per-joint mean error for all the test examples. Bottom: Percentage of frames in the testing examples under different error thresholds.
  • ...and 3 more figures