Table of Contents
Fetching ...

ReJSHand: Efficient Real-Time Hand Pose Estimation and Mesh Reconstruction Using Refined Joint and Skeleton Features

Shan An, Shipeng Dai, Mahrukh Ansari, Yu Liang, Ming Zeng, Konstantinos A. Tsintotas, Changhong Fu, Hong Zhang

TL;DR

ReJSHand tackles real-time monocular hand pose estimation and mesh reconstruction by introducing a lightweight architecture that maps 2D image features to 2D keypoints, 3D keypoints, and a hand mesh. The method combines an expansion block for high-resolution feature upsampling, a threefold feature-interaction core with coordinate-attention and multi-head self-attention, and a MANO-based joint matrix to produce accurate 3D joints and detailed mesh vertices. Training optimizes losses on 2D keypoints, 3D keypoints, and mesh vertices with carefully chosen weights, achieving $ PA$-$MPJPE$ of $6.3$ mm, $ PA$-$MPVPE$ of $6.4$ mm, and real-time performance at $72$ FPS on standard GPUs with only $1.91$M parameters. Evaluations on the FreiHand dataset demonstrate strong accuracy and unprecedented speed, making ReJSHand well-suited for dynamic robotic manipulation and interactive applications. The work highlights a practical, efficient pathway for high-fidelity hand pose and mesh estimation in real time.

Abstract

Accurate hand pose estimation is vital in robotics, advancing dexterous manipulation in human-computer interaction. Toward this goal, this paper presents ReJSHand (which stands for Refined Joint and Skeleton Features), a cutting-edge network formulated for real-time hand pose estimation and mesh reconstruction. The proposed framework is designed to accurately predict 3D hand gestures under real-time constraints, which is essential for systems that demand agile and responsive hand motion tracking. The network's design prioritizes computational efficiency without compromising accuracy, a prerequisite for instantaneous robotic interactions. Specifically, ReJSHand comprises a 2D keypoint generator, a 3D keypoint generator, an expansion block, and a feature interaction block for meticulously reconstructing 3D hand poses from 2D imagery. In addition, the multi-head self-attention mechanism and a coordinate attention layer enhance feature representation, streamlining the creation of hand mesh vertices through sophisticated feature mapping and linear transformation. Regarding performance, comprehensive evaluations on the FreiHand dataset demonstrate ReJSHand's computational prowess. It achieves a frame rate of 72 frames per second while maintaining a PA-MPJPE (Position-Accurate Mean Per Joint Position Error) of 6.3 mm and a PA-MPVPE (Position-Accurate Mean Per Vertex Position Error) of 6.4 mm. Moreover, our model reaches scores of 0.756 for F@05 and 0.984 for F@15, surpassing modern pipelines and solidifying its position at the forefront of robotic hand pose estimators. To facilitate future studies, we provide our source code at ~\url{https://github.com/daishipeng/ReJSHand}.

ReJSHand: Efficient Real-Time Hand Pose Estimation and Mesh Reconstruction Using Refined Joint and Skeleton Features

TL;DR

ReJSHand tackles real-time monocular hand pose estimation and mesh reconstruction by introducing a lightweight architecture that maps 2D image features to 2D keypoints, 3D keypoints, and a hand mesh. The method combines an expansion block for high-resolution feature upsampling, a threefold feature-interaction core with coordinate-attention and multi-head self-attention, and a MANO-based joint matrix to produce accurate 3D joints and detailed mesh vertices. Training optimizes losses on 2D keypoints, 3D keypoints, and mesh vertices with carefully chosen weights, achieving - of mm, - of mm, and real-time performance at FPS on standard GPUs with only M parameters. Evaluations on the FreiHand dataset demonstrate strong accuracy and unprecedented speed, making ReJSHand well-suited for dynamic robotic manipulation and interactive applications. The work highlights a practical, efficient pathway for high-fidelity hand pose and mesh estimation in real time.

Abstract

Accurate hand pose estimation is vital in robotics, advancing dexterous manipulation in human-computer interaction. Toward this goal, this paper presents ReJSHand (which stands for Refined Joint and Skeleton Features), a cutting-edge network formulated for real-time hand pose estimation and mesh reconstruction. The proposed framework is designed to accurately predict 3D hand gestures under real-time constraints, which is essential for systems that demand agile and responsive hand motion tracking. The network's design prioritizes computational efficiency without compromising accuracy, a prerequisite for instantaneous robotic interactions. Specifically, ReJSHand comprises a 2D keypoint generator, a 3D keypoint generator, an expansion block, and a feature interaction block for meticulously reconstructing 3D hand poses from 2D imagery. In addition, the multi-head self-attention mechanism and a coordinate attention layer enhance feature representation, streamlining the creation of hand mesh vertices through sophisticated feature mapping and linear transformation. Regarding performance, comprehensive evaluations on the FreiHand dataset demonstrate ReJSHand's computational prowess. It achieves a frame rate of 72 frames per second while maintaining a PA-MPJPE (Position-Accurate Mean Per Joint Position Error) of 6.3 mm and a PA-MPVPE (Position-Accurate Mean Per Vertex Position Error) of 6.4 mm. Moreover, our model reaches scores of 0.756 for F@05 and 0.984 for F@15, surpassing modern pipelines and solidifying its position at the forefront of robotic hand pose estimators. To facilitate future studies, we provide our source code at ~\url{https://github.com/daishipeng/ReJSHand}.

Paper Structure

This paper contains 17 sections, 5 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: We compare hand pose estimators in terms of their accuracy and computational efficiency. ReJSHand achieves the best balance between accuracy and speed. All tests were conducted on an NVIDIA 2080Ti GPU.
  • Figure 2: An overview of the proposed lightweight network for real-time hand pose estimation and mesh reconstruction, ReJSHand, is provided. First, the cropped hand images are processed through the backbone network to extract features. Next, the 2D keypoint generator maps these features to 2D coordinates. Simultaneously, the expansion block upsamples the feature map using transposed convolutional layers and sampling techniques. By jointly mapping both features, we leverage their complementary and synergistic roles in our hand pose estimator. The feature interaction block then refines the joint and skeleton features by learning coordinate dependencies through coordinate and multi-head attention modules. Subsequently, the mesh token generator integrates these refined features to generate mesh vertices. Finally, the 3D keypoint generator maps the mesh vertices to 2D keypoint coordinates by integrating the joint matrix.
  • Figure 3: Qualitative comparison of the hand meshes produced by ReJSHand and other state-of-the-art methods. Results show that our approach achieves more accurate hand mesh reconstruction outcomes that are closer to the ground truth.