Table of Contents
Fetching ...

HandDAGT: A Denoising Adaptive Graph Transformer for 3D Hand Pose Estimation

Wencan Cheng, Eunji Kim, Jong Hwan Ko

TL;DR

HandDAGT introduces a denoising adaptive graph Transformer for robust 3D hand pose estimation under severe occlusion. By fusing depth and point-cloud information into super points, and applying an adaptive attention mechanism that balances local geometry and kinematic topology, the method dynamically adjusts to occlusion conditions. The denoising training strategy further enhances robustness by destabilizing initial patches during training. Across four challenging datasets, HandDAGT achieves state-of-the-art mean keypoint errors, demonstrating strong practical potential for occlusion-rich hand interactions in HCI and AR/VR applications.

Abstract

The extraction of keypoint positions from input hand frames, known as 3D hand pose estimation, is crucial for various human-computer interaction applications. However, current approaches often struggle with the dynamic nature of self-occlusion of hands and intra-occlusion with interacting objects. To address this challenge, this paper proposes the Denoising Adaptive Graph Transformer, HandDAGT, for hand pose estimation. The proposed HandDAGT leverages a transformer structure to thoroughly explore effective geometric features from input patches. Additionally, it incorporates a novel attention mechanism to adaptively weigh the contribution of kinematic correspondence and local geometric features for the estimation of specific keypoints. This attribute enables the model to adaptively employ kinematic and local information based on the occlusion situation, enhancing its robustness and accuracy. Furthermore, we introduce a novel denoising training strategy aimed at improving the model's robust performance in the face of occlusion challenges. Experimental results show that the proposed model significantly outperforms the existing methods on four challenging hand pose benchmark datasets. Codes and pre-trained models are publicly available at https://github.com/cwc1260/HandDAGT.

HandDAGT: A Denoising Adaptive Graph Transformer for 3D Hand Pose Estimation

TL;DR

HandDAGT introduces a denoising adaptive graph Transformer for robust 3D hand pose estimation under severe occlusion. By fusing depth and point-cloud information into super points, and applying an adaptive attention mechanism that balances local geometry and kinematic topology, the method dynamically adjusts to occlusion conditions. The denoising training strategy further enhances robustness by destabilizing initial patches during training. Across four challenging datasets, HandDAGT achieves state-of-the-art mean keypoint errors, demonstrating strong practical potential for occlusion-rich hand interactions in HCI and AR/VR applications.

Abstract

The extraction of keypoint positions from input hand frames, known as 3D hand pose estimation, is crucial for various human-computer interaction applications. However, current approaches often struggle with the dynamic nature of self-occlusion of hands and intra-occlusion with interacting objects. To address this challenge, this paper proposes the Denoising Adaptive Graph Transformer, HandDAGT, for hand pose estimation. The proposed HandDAGT leverages a transformer structure to thoroughly explore effective geometric features from input patches. Additionally, it incorporates a novel attention mechanism to adaptively weigh the contribution of kinematic correspondence and local geometric features for the estimation of specific keypoints. This attribute enables the model to adaptively employ kinematic and local information based on the occlusion situation, enhancing its robustness and accuracy. Furthermore, we introduce a novel denoising training strategy aimed at improving the model's robust performance in the face of occlusion challenges. Experimental results show that the proposed model significantly outperforms the existing methods on four challenging hand pose benchmark datasets. Codes and pre-trained models are publicly available at https://github.com/cwc1260/HandDAGT.
Paper Structure (15 sections, 8 equations, 6 figures, 5 tables)

This paper contains 15 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Illustration of the HandDAGT concept. The model feeds 3D local patches cropped from input depth images and corresponding point clouds to the adaptive graph transformer for the keypoint coordinate estimation. The local patches are disturbed by the random Gaussian noise during training for a robust performance.
  • Figure 2: The HandDAGT architecture. HandDAGT takes a 2D depth image and the sub-sampled point cloud as the input. The PointNet-based local 3D encoder and the 2D auto-encoder extracts local 3D features and local 2D features, respectively. Then, the 2D features are projected into 3D space to fuse with 3D features forming the super point features F. Based on the super points, keypoint embeddings E and 3D patches are extracted as input to the novel adaptive graph transformer to estimate accurate 3D keypoint coordinates by leveraging the dynamic kinematic correspondences and local details. Notably, during the training stage, the local 3D patches are shifted by random noises in order to enforce the model providing robust estimations.
  • Figure 3: Comparison with the state-of-the-art methods using the ICVL (left) and NYU (right) dataset. The per keypoint error (top) and success rate (bottom) are shown in this figure.
  • Figure 4: Qualitative results of HandDAGT on the ICVL (top) and NYU (bottom) datasets. Hand-depth images are transformed into 3D points to clearly illustrate occlusions. Ground truth keypoints are represented in black, results from the comparative HandR2N2 cheng2023handr2n2 model are shown in blue, and the estimated keypoint coordinates of our model are depicted in red. The bottom figures showcase self-occluded/truncated cases (left) and well-performed cases without occlusion (right) in the NYU dataset.
  • Figure 5: Qualitative results of HandDAGT on the DexYCB dataset including different grabbing poses (top), interacting objects (2nd row), object occlusions (3rd row), and self-occlusions (bottom). Hand-depth images (left) are transformed into 3D points (right) in order to clearly present occlusions as shown in the figure. Ground truth is shown in black and the estimated keypoint coordinates of our model are shown in colors.
  • ...and 1 more figures