Table of Contents
Fetching ...

A Simple Baseline for Efficient Hand Mesh Reconstruction

Zhishan Zhou, Shihao. zhou, Zhi Lv, Minqiang Zou, Yao Tang, Jiajun Liang

TL;DR

This paper decomposes the mesh decoder into token generator and mesh regressor, finding that the token generator should select discriminating and representative points, while the mesh regressor needs to upsample sparse keypoints into dense meshes in multiple stages.

Abstract

3D hand pose estimation has found broad application in areas such as gesture recognition and human-machine interaction tasks. As performance improves, the complexity of the systems also increases, which can limit the comparative analysis and practical implementation of these methods. In this paper, we propose a simple yet effective baseline that not only surpasses state-of-the-art (SOTA) methods but also demonstrates computational efficiency. To establish this baseline, we abstract existing work into two components: a token generator and a mesh regressor, and then examine their core structures. A core structure, in this context, is one that fulfills intrinsic functions, brings about significant improvements, and achieves excellent performance without unnecessary complexities. Our proposed approach is decoupled from any modifications to the backbone, making it adaptable to any modern models. Our method outperforms existing solutions, achieving state-of-the-art (SOTA) results across multiple datasets. On the FreiHAND dataset, our approach produced a PA-MPJPE of 5.7mm and a PA-MPVPE of 6.0mm. Similarly, on the Dexycb dataset, we observed a PA-MPJPE of 5.5mm and a PA-MPVPE of 5.0mm. As for performance speed, our method reached up to 33 frames per second (fps) when using HRNet and up to 70 fps when employing FastViT-MA36

A Simple Baseline for Efficient Hand Mesh Reconstruction

TL;DR

This paper decomposes the mesh decoder into token generator and mesh regressor, finding that the token generator should select discriminating and representative points, while the mesh regressor needs to upsample sparse keypoints into dense meshes in multiple stages.

Abstract

3D hand pose estimation has found broad application in areas such as gesture recognition and human-machine interaction tasks. As performance improves, the complexity of the systems also increases, which can limit the comparative analysis and practical implementation of these methods. In this paper, we propose a simple yet effective baseline that not only surpasses state-of-the-art (SOTA) methods but also demonstrates computational efficiency. To establish this baseline, we abstract existing work into two components: a token generator and a mesh regressor, and then examine their core structures. A core structure, in this context, is one that fulfills intrinsic functions, brings about significant improvements, and achieves excellent performance without unnecessary complexities. Our proposed approach is decoupled from any modifications to the backbone, making it adaptable to any modern models. Our method outperforms existing solutions, achieving state-of-the-art (SOTA) results across multiple datasets. On the FreiHAND dataset, our approach produced a PA-MPJPE of 5.7mm and a PA-MPVPE of 6.0mm. Similarly, on the Dexycb dataset, we observed a PA-MPJPE of 5.5mm and a PA-MPVPE of 5.0mm. As for performance speed, our method reached up to 33 frames per second (fps) when using HRNet and up to 70 fps when employing FastViT-MA36
Paper Structure (16 sections, 7 equations, 8 figures, 8 tables)

This paper contains 16 sections, 7 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Trade-off between Accuracy and Inference Speed. Our technique surpasses non-real-time methods($\leq$ 40 fps) in both speed and precision. Compared to real-time methods ($\geq$ 70 fps), it offers a substantial boost in accuracy while preserving comparable speeds. For fair comparison, all speed evaluations were conducted on a 2080ti GPU with a batch size of one.
  • Figure 2: An illustration demonstrates various designs of token generators. The grids colored in red represent the sampled points. a) Global feature; b) Grid sampling; c) Keypoint-guided sampling on the original feature map; d) Keypoint-guided sampling with 4x upsampling, resulting in an enhanced feature; e) Keypoint-guided sampling with 4x upsampling, where the feature is further improved by convolution; f) Coarse-mesh-guided point sampling with 4x upsampling.
  • Figure 3: Architecture of decoder layer in mesh regressor. It is composed of sequentially connected dimension reduce layer, metaformer block and upsample layer.
  • Figure 4: Overview of our architecture. The architecture of our model proceeds as follows: Firstly, the image feature $X_b$ is extracted via a backbone network. These features are then passed to our token generator module, responsible for predicting 2D keypoints and performing point sampling on the upsampled feature map, thus generating joint tokens. Next, these joint tokens are input into our mesh regressor module, which carries out the mesh prediction to get the final coordinates.
  • Figure 5: Qualitative comparison between our method and other state-of-the-art approaches.
  • ...and 3 more figures