Table of Contents
Fetching ...

Fast-HaMeR: Boosting Hand Mesh Reconstruction using Knowledge Distillation

Hunain Ahmed Jillani, Ahmed Tawfik Aboukhadra, Ahmed Elhayek, Jameel Malik, Nadia Robertini, Didier Stricker

Abstract

Fast and accurate 3D hand reconstruction is essential for real-time applications in VR/AR, human-computer interaction, robotics, and healthcare. Most state-of-the-art methods rely on heavy models, limiting their use on resource-constrained devices like headsets, smartphones, and embedded systems. In this paper, we investigate how the use of lightweight neural networks, combined with Knowledge Distillation, can accelerate complex 3D hand reconstruction models by making them faster and lighter, while maintaining comparable reconstruction accuracy. While our approach is suited for various hand reconstruction frameworks, we focus primarily on boosting the HaMeR model, currently the leading method in terms of reconstruction accuracy. We replace its original ViT-H backbone with lighter alternatives, including MobileNet, MobileViT, ConvNeXt, and ResNet, and evaluate three knowledge distillation strategies: output-level, feature-level, and a hybrid of both. Our experiments show that using lightweight backbones that are only 35% the size of the original achieves 1.5x faster inference speed while preserving similar performance quality with only a minimal accuracy difference of 0.4mm. More specifically, we show how output-level distillation notably improves student performance, while feature-level distillation proves more effective for higher-capacity students. Overall, the findings pave the way for efficient real-world applications on low-power devices. The code and models are publicly available under https://github.com/hunainahmedj/Fast-HaMeR.

Fast-HaMeR: Boosting Hand Mesh Reconstruction using Knowledge Distillation

Abstract

Fast and accurate 3D hand reconstruction is essential for real-time applications in VR/AR, human-computer interaction, robotics, and healthcare. Most state-of-the-art methods rely on heavy models, limiting their use on resource-constrained devices like headsets, smartphones, and embedded systems. In this paper, we investigate how the use of lightweight neural networks, combined with Knowledge Distillation, can accelerate complex 3D hand reconstruction models by making them faster and lighter, while maintaining comparable reconstruction accuracy. While our approach is suited for various hand reconstruction frameworks, we focus primarily on boosting the HaMeR model, currently the leading method in terms of reconstruction accuracy. We replace its original ViT-H backbone with lighter alternatives, including MobileNet, MobileViT, ConvNeXt, and ResNet, and evaluate three knowledge distillation strategies: output-level, feature-level, and a hybrid of both. Our experiments show that using lightweight backbones that are only 35% the size of the original achieves 1.5x faster inference speed while preserving similar performance quality with only a minimal accuracy difference of 0.4mm. More specifically, we show how output-level distillation notably improves student performance, while feature-level distillation proves more effective for higher-capacity students. Overall, the findings pave the way for efficient real-world applications on low-power devices. The code and models are publicly available under https://github.com/hunainahmedj/Fast-HaMeR.
Paper Structure (19 sections, 4 equations, 4 figures, 6 tables)

This paper contains 19 sections, 4 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Real-time monocular 3D hand mesh reconstruction using lightweight student networks. We present a Knowledge Distillation approach for training lightweight networks that accelerate 3D hand reconstruction without compromising quality. Our best-performing network achieves real-time inference with only a modest drop in reconstruction quality. The Figure showcases qualitative results across challenging scenarios, including mutual occlusion, complex hand poses, and interactions with various objects.
  • Figure 2: High-level overview of the teacher-student architecture, only relevant distillation losses are used depending on the KD level. $\phi(F_T)$ refers to a 1x1 convolution to project the dimensions of the teacher's features to match those of the student's for feature-level distillation. The teacher network is used only during training; at inference time, only the trained student network is used.
  • Figure 3: Trade-off between model accuracy (PA-MPJPE), speed (FPS), and parameter size (circle size). The Figure shows that the most accurate model is HaMeR; however, it shows that other alternatives give close performance with fewer resources and better runtime. ConvNeXt-L with feature distillation falls right behind HaMeR in our experiments with 1.48x FPS boost and 64% reduction in size.
  • Figure 4: Qualitative results on images from the internet, the scenes represent hands interacting with the environment. We compare our results of the best configuration (ConvNext-L with feature-level distillation) with HaMeR both in 2D and 3D.