Table of Contents
Fetching ...

HandDiff: 3D Hand Pose Estimation with Diffusion on Image-Point Cloud

Wencan Cheng, Hao Tang, Luc Van Gool, Jong Hwan Ko

TL;DR

This paper tackles 3D hand pose estimation under uncertainty by treating keypoint generation as a diffusion process conditioned on depth images and hand point clouds. It introduces HandDiff, a diffusion model that uses joint-wise condition extraction and a local feature-conditioned denoiser, augmented by a kinematic correspondence-aware layer, to iteratively refine hand joints. Training uses a forward diffusion with a smooth L1 loss, while inference generates multiple pose hypotheses via DDIM and aggregates them to a robust 3D hand pose. The approach achieves state-of-the-art results on MSRA, ICVL, NYU, and DexYCB datasets, demonstrating improved accuracy and robustness to occlusions with real-time-like performance. These advances offer practical impact for HCI and robotics by delivering reliable 3D hand poses directly from depth and point-cloud modalities, without heavy reliance on 2D priors.

Abstract

Extracting keypoint locations from input hand frames, known as 3D hand pose estimation, is a critical task in various human-computer interaction applications. Essentially, the 3D hand pose estimation can be regarded as a 3D point subset generative problem conditioned on input frames. Thanks to the recent significant progress on diffusion-based generative models, hand pose estimation can also benefit from the diffusion model to estimate keypoint locations with high quality. However, directly deploying the existing diffusion models to solve hand pose estimation is non-trivial, since they cannot achieve the complex permutation mapping and precise localization. Based on this motivation, this paper proposes HandDiff, a diffusion-based hand pose estimation model that iteratively denoises accurate hand pose conditioned on hand-shaped image-point clouds. In order to recover keypoint permutation and accurate location, we further introduce joint-wise condition and local detail condition. Experimental results demonstrate that the proposed HandDiff significantly outperforms the existing approaches on four challenging hand pose benchmark datasets. Codes and pre-trained models are publicly available at https://github.com/cwc1260/HandDiff.

HandDiff: 3D Hand Pose Estimation with Diffusion on Image-Point Cloud

TL;DR

This paper tackles 3D hand pose estimation under uncertainty by treating keypoint generation as a diffusion process conditioned on depth images and hand point clouds. It introduces HandDiff, a diffusion model that uses joint-wise condition extraction and a local feature-conditioned denoiser, augmented by a kinematic correspondence-aware layer, to iteratively refine hand joints. Training uses a forward diffusion with a smooth L1 loss, while inference generates multiple pose hypotheses via DDIM and aggregates them to a robust 3D hand pose. The approach achieves state-of-the-art results on MSRA, ICVL, NYU, and DexYCB datasets, demonstrating improved accuracy and robustness to occlusions with real-time-like performance. These advances offer practical impact for HCI and robotics by delivering reliable 3D hand poses directly from depth and point-cloud modalities, without heavy reliance on 2D priors.

Abstract

Extracting keypoint locations from input hand frames, known as 3D hand pose estimation, is a critical task in various human-computer interaction applications. Essentially, the 3D hand pose estimation can be regarded as a 3D point subset generative problem conditioned on input frames. Thanks to the recent significant progress on diffusion-based generative models, hand pose estimation can also benefit from the diffusion model to estimate keypoint locations with high quality. However, directly deploying the existing diffusion models to solve hand pose estimation is non-trivial, since they cannot achieve the complex permutation mapping and precise localization. Based on this motivation, this paper proposes HandDiff, a diffusion-based hand pose estimation model that iteratively denoises accurate hand pose conditioned on hand-shaped image-point clouds. In order to recover keypoint permutation and accurate location, we further introduce joint-wise condition and local detail condition. Experimental results demonstrate that the proposed HandDiff significantly outperforms the existing approaches on four challenging hand pose benchmark datasets. Codes and pre-trained models are publicly available at https://github.com/cwc1260/HandDiff.
Paper Structure (13 sections, 10 equations, 6 figures, 4 tables)

This paper contains 13 sections, 10 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Illustration of the hand pose diffusion concept. The model extracts features from input depth images and corresponding point clouds as joint-wise and local conditions to guide the iterative denoising process that recovers accurate hand poses from diffused noisy pose distributions.
  • Figure 2: The pipeline of the proposed HandDiff. HandDiff takes the normalized point cloud transformed from a 2D depth image as the input. The PointNet-based local condition encoder extracts local features, aka local conditions, from input points. Then, a joint-wise condition extractor aggregates local features into latent features of each joint. Conditioned on the joint-wise conditions and local conditions sampled around each joint, the joint-wise local feature-conditioned denoiser iteratively recovers an accurate 3D hand pose by denoising the diffused noisy pose. Notably, a noiser proposed in DDIM is applied to add noise to the denoised pose for subsequent denoising steps.
  • Figure 3: Qualitative results of HandDiff on the DexYCB datasets including different grabbing poses (top), self-occlusions (middle), and object occlusions (bottom). Hand-depth images (first rows) are transformed into 3D points (second rows) in order to clearly present occlusions as shown in the figure. Ground truth is shown in black and the estimated joint coordinates of our model are shown in colors.
  • Figure 4: Comparison with the state-of-the-art methods using the ICVL (left), MSRA (middle), and NYU (right) dataset. The per joint error (top) and success rate (bottom) are shown in this figure.
  • Figure 5: Qualitative results of HandDiff on the ICVL (left), MSRA (middle), and NYU (right) datasets. Hand-depth images are transformed into 3D points in order to clearly present occlusions as shown in the figure. Ground truth is shown in black, results from comparative HandFoldingNet cheng2021handfoldingnet are shown in orange, and the estimated joint coordinates of our model are shown in red.
  • ...and 1 more figures