Table of Contents
Fetching ...

MS-MANO: Enabling Hand Pose Tracking with Biomechanical Constraints

Pengfei Xie, Wenqiang Xu, Tutian Tang, Zhenjun Yu, Cewu Lu

TL;DR

The paper addresses the mismatch between learned hand dynamics and physiological realism by introducing MS-MANO, a musculoskeletal extension of MANO that enforces biomechanical constraints through Hill-type muscles. It couples this model with BioPR, a simulation-in-the-loop pose refinement framework that uses IDNet to infer muscle excitations and a forward simulator to generate a biomechanically plausible reference pose, refined by an MLP. The approach shows improved anatomical plausibility and quantitative gains over baselines on DexYCB and OakInk, with BioPR providing consistent improvements and a small runtime overhead. This work significantly advances visual hand dynamics analysis by integrating biomechanics into learnable hand models, enabling more human-like motion under occlusion and temporal perturbations, with practical implications for animation, robotics, and AR/VR applications.

Abstract

This work proposes a novel learning framework for visual hand dynamics analysis that takes into account the physiological aspects of hand motion. The existing models, which are simplified joint-actuated systems, often produce unnatural motions. To address this, we integrate a musculoskeletal system with a learnable parametric hand model, MANO, to create a new model, MS-MANO. This model emulates the dynamics of muscles and tendons to drive the skeletal system, imposing physiologically realistic constraints on the resulting torque trajectories. We further propose a simulation-in-the-loop pose refinement framework, BioPR, that refines the initial estimated pose through a multi-layer perceptron (MLP) network. Our evaluation of the accuracy of MS-MANO and the efficacy of the BioPR is conducted in two separate parts. The accuracy of MS-MANO is compared with MyoSuite, while the efficacy of BioPR is benchmarked against two large-scale public datasets and two recent state-of-the-art methods. The results demonstrate that our approach consistently improves the baseline methods both quantitatively and qualitatively.

MS-MANO: Enabling Hand Pose Tracking with Biomechanical Constraints

TL;DR

The paper addresses the mismatch between learned hand dynamics and physiological realism by introducing MS-MANO, a musculoskeletal extension of MANO that enforces biomechanical constraints through Hill-type muscles. It couples this model with BioPR, a simulation-in-the-loop pose refinement framework that uses IDNet to infer muscle excitations and a forward simulator to generate a biomechanically plausible reference pose, refined by an MLP. The approach shows improved anatomical plausibility and quantitative gains over baselines on DexYCB and OakInk, with BioPR providing consistent improvements and a small runtime overhead. This work significantly advances visual hand dynamics analysis by integrating biomechanics into learnable hand models, enabling more human-like motion under occlusion and temporal perturbations, with practical implications for animation, robotics, and AR/VR applications.

Abstract

This work proposes a novel learning framework for visual hand dynamics analysis that takes into account the physiological aspects of hand motion. The existing models, which are simplified joint-actuated systems, often produce unnatural motions. To address this, we integrate a musculoskeletal system with a learnable parametric hand model, MANO, to create a new model, MS-MANO. This model emulates the dynamics of muscles and tendons to drive the skeletal system, imposing physiologically realistic constraints on the resulting torque trajectories. We further propose a simulation-in-the-loop pose refinement framework, BioPR, that refines the initial estimated pose through a multi-layer perceptron (MLP) network. Our evaluation of the accuracy of MS-MANO and the efficacy of the BioPR is conducted in two separate parts. The accuracy of MS-MANO is compared with MyoSuite, while the efficacy of BioPR is benchmarked against two large-scale public datasets and two recent state-of-the-art methods. The results demonstrate that our approach consistently improves the baseline methods both quantitatively and qualitatively.
Paper Structure (33 sections, 9 equations, 6 figures, 1 table)

This paper contains 33 sections, 9 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: The physiological mechanism of hand dynamics. (a) The excitation signal originating from the brain triggers the contraction or relaxation of muscles. The triggered muscle segments are illustrated in green, while the relaxed ones are in brown. (b) The muscle contraction triggered by excitation manifests as the movement of the hand in appearance.
  • Figure 2: The hill-type muscle. (a) Each muscle segment is composed of the contractile element CE, the parallel elastic element PEE, and the serial elastic element SEE. (b) Each muscle segment originates from a certain point $n_\text{origin}$ and ends at $n_\text{insertion}$. A joint $j$ connects two bones. Once triggered, the muscle segment can apply torque $\bm\tau$ on the joint.
  • Figure 3: Joint-centric muscle adaptation. (a) A set of smaller bones in the MyoHand model is mapped into a single joint in the MANO model. (b) The bone-centric muscle segments can adapt to different shapes. (c) (Left) The raw skeleton after the automatic mapping will result in issues like intersection. (Right) The manually revised skeleton can perfectly fit with the MANO model.
  • Figure 4: The simulation-in-the-loop pipeline of BioPR. Given a sequence of RGB images and the corresponding predictions of an existing hand pose estimator, BioPR first interpolates and differentiates the poses to get the joint velocities. Then, the IDNet is used to infer the muscle excitation signals. The joint poses, velocities, excitation signals, and the poses of the previous frame (denoted by dotted lines) are sent into the simulator, which will generate the next reference pose by forward kinematics. The Refine Net will do the final refinement based on the pose, velocity, and reference pose. On the next frame, the refined pose can be fed back to the simulator.
  • Figure 5: Qualitative results on DexYCB. Left: When a person is forcefully grasping a mustard bottle, there is a difference in the tightness of the middle, ring, and little fingers, comparing gSDF to our method. The projected results of our method better align with the input image. Middle: The thumb posture predicted by the gSDF method exhibits some odd distortion, which is not observed in our approach. Right: When there is severe occlusion, gSDF may generate some hand poses that lead to punctuation with the object. Our method mitigates such problem by catching the dynamics of the musculoskeletal system.
  • ...and 1 more figures