Table of Contents
Fetching ...

VM-BHINet:Vision Mamba Bimanual Hand Interaction Network for 3D Interacting Hand Mesh Recovery From a Single RGB Image

Han Bi, Ge Yu, Yu He, Wenzhuo Liu, Zijie Zheng

TL;DR

The paper tackles robust 3D reconstruct ion of bimanual hand interactions from monocular RGB by introducing VM-BHINet, which integrates State Space Models (SSMs) into a Vision Mamba framework to capture dynamic inter-hand dependencies. The architecture comprises a ResNet-50 backbone, VM-IFEBlock for inter-hand feature fusion, Hand Joint Feature Extractor, Joint Vision Mamba Block, and a Dual Hand Parameter Regressor that outputs MANO pose/shape and 3D translation. Key contributions include the first use of SSMs in 3D interacting hand reconstruction and a demonstrated 2–3% reduction in MPJPE/MPVPE on InterHand2.6M, along with competitive results on HIC and significantly fewer parameters and FLOPs. The approach offers a practical, accurate, and efficient solution for real-time 3D hand mesh recovery in challenging bimanual scenes, with potential impact on AR/VR and HCI applications.

Abstract

Understanding bimanual hand interactions is essential for realistic 3D pose and shape reconstruction. However, existing methods struggle with occlusions, ambiguous appearances, and computational inefficiencies. To address these challenges, we propose Vision Mamba Bimanual Hand Interaction Network (VM-BHINet), introducing state space models (SSMs) into hand reconstruction to enhance interaction modeling while improving computational efficiency. The core component, Vision Mamba Interaction Feature Extraction Block (VM-IFEBlock), combines SSMs with local and global feature operations, enabling deep understanding of hand interactions. Experiments on the InterHand2.6M dataset show that VM-BHINet reduces Mean per-joint position error (MPJPE) and Mean per-vertex position error (MPVPE) by 2-3%, significantly surpassing state-of-the-art methods.

VM-BHINet:Vision Mamba Bimanual Hand Interaction Network for 3D Interacting Hand Mesh Recovery From a Single RGB Image

TL;DR

The paper tackles robust 3D reconstruct ion of bimanual hand interactions from monocular RGB by introducing VM-BHINet, which integrates State Space Models (SSMs) into a Vision Mamba framework to capture dynamic inter-hand dependencies. The architecture comprises a ResNet-50 backbone, VM-IFEBlock for inter-hand feature fusion, Hand Joint Feature Extractor, Joint Vision Mamba Block, and a Dual Hand Parameter Regressor that outputs MANO pose/shape and 3D translation. Key contributions include the first use of SSMs in 3D interacting hand reconstruction and a demonstrated 2–3% reduction in MPJPE/MPVPE on InterHand2.6M, along with competitive results on HIC and significantly fewer parameters and FLOPs. The approach offers a practical, accurate, and efficient solution for real-time 3D hand mesh recovery in challenging bimanual scenes, with potential impact on AR/VR and HCI applications.

Abstract

Understanding bimanual hand interactions is essential for realistic 3D pose and shape reconstruction. However, existing methods struggle with occlusions, ambiguous appearances, and computational inefficiencies. To address these challenges, we propose Vision Mamba Bimanual Hand Interaction Network (VM-BHINet), introducing state space models (SSMs) into hand reconstruction to enhance interaction modeling while improving computational efficiency. The core component, Vision Mamba Interaction Feature Extraction Block (VM-IFEBlock), combines SSMs with local and global feature operations, enabling deep understanding of hand interactions. Experiments on the InterHand2.6M dataset show that VM-BHINet reduces Mean per-joint position error (MPJPE) and Mean per-vertex position error (MPVPE) by 2-3%, significantly surpassing state-of-the-art methods.

Paper Structure

This paper contains 18 sections, 11 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Visual results of the VM-BHINet. VM-BHINet achieves remarkable visual performance in various hand poses.
  • Figure 2: Our proposed Vision Mamba Bimanual Hand Interaction Network (VM-BHINet). It consists of five main components: the Backbone, the Vision Mamba Interaction Feature Extraction Block (VM-IFEBlock), Hand Joint Feature Extractor (HJFE), Joint Vision Mamba Block (JVMBlock), and the Dual Hand Parameter Regressor (DHPR).
  • Figure 3: The illustration of the proposed VMblock.
  • Figure 4: Qualitative Ablation Study on InterHand2.6M moon2020interhand2 dataset. The results show that our full model achieves the best performance compared to the versions that exclude certain components, where 'w/o' stands for 'without'.
  • Figure 5: Qualitative comparison of the interacting hand reconstruction with our method and the state-of-the-art methods IntagHand li2022interacting and EANet park2023extract on InterHand2.6M moon2020interhand2 dataset. Our approach demonstrates superior reconstruction quality across various viewpoints and different levels of interhand occlusion.