VM-BHINet:Vision Mamba Bimanual Hand Interaction Network for 3D Interacting Hand Mesh Recovery From a Single RGB Image
Han Bi, Ge Yu, Yu He, Wenzhuo Liu, Zijie Zheng
TL;DR
The paper tackles robust 3D reconstruct ion of bimanual hand interactions from monocular RGB by introducing VM-BHINet, which integrates State Space Models (SSMs) into a Vision Mamba framework to capture dynamic inter-hand dependencies. The architecture comprises a ResNet-50 backbone, VM-IFEBlock for inter-hand feature fusion, Hand Joint Feature Extractor, Joint Vision Mamba Block, and a Dual Hand Parameter Regressor that outputs MANO pose/shape and 3D translation. Key contributions include the first use of SSMs in 3D interacting hand reconstruction and a demonstrated 2–3% reduction in MPJPE/MPVPE on InterHand2.6M, along with competitive results on HIC and significantly fewer parameters and FLOPs. The approach offers a practical, accurate, and efficient solution for real-time 3D hand mesh recovery in challenging bimanual scenes, with potential impact on AR/VR and HCI applications.
Abstract
Understanding bimanual hand interactions is essential for realistic 3D pose and shape reconstruction. However, existing methods struggle with occlusions, ambiguous appearances, and computational inefficiencies. To address these challenges, we propose Vision Mamba Bimanual Hand Interaction Network (VM-BHINet), introducing state space models (SSMs) into hand reconstruction to enhance interaction modeling while improving computational efficiency. The core component, Vision Mamba Interaction Feature Extraction Block (VM-IFEBlock), combines SSMs with local and global feature operations, enabling deep understanding of hand interactions. Experiments on the InterHand2.6M dataset show that VM-BHINet reduces Mean per-joint position error (MPJPE) and Mean per-vertex position error (MPVPE) by 2-3%, significantly surpassing state-of-the-art methods.
