Two Hands Are Better Than One: Resolving Hand to Hand Intersections via Occupancy Networks
Maksym Ivashechkin, Oscar Mendez, Richard Bowden
TL;DR
This work tackles two-hand 3D pose estimation under mutual occlusions by introducing a continuous occupancy network that represents hand volume and a differentiable intersection loss to discourage hand–hand intersections. It couples this with a novel, watertight hand mesh parameterization that reduces mesh complexity and enables reliable inside/outside tests via ray casting. The approach achieves state-of-the-art reduction in hand intersections on InterHand2.6M while maintaining or improving MPJPE, and demonstrates robust intersection suppression in the Re:InterHand and SMILE datasets as well as in-the-wild scenarios. The framework is modular and can enhance existing single-hand estimators, offering practical benefits for fine-grained hand interactions in sign language and other two-hand tasks.
Abstract
3D hand pose estimation from images has seen considerable interest from the literature, with new methods improving overall 3D accuracy. One current challenge is to address hand-to-hand interaction where self-occlusions and finger articulation pose a significant problem to estimation. Little work has applied physical constraints that minimize the hand intersections that occur as a result of noisy estimation. This work addresses the intersection of hands by exploiting an occupancy network that represents the hand's volume as a continuous manifold. This allows us to model the probability distribution of points being inside a hand. We designed an intersection loss function to minimize the likelihood of hand-to-point intersections. Moreover, we propose a new hand mesh parameterization that is superior to the commonly used MANO model in many respects including lower mesh complexity, underlying 3D skeleton extraction, watertightness, etc. On the benchmark InterHand2.6M dataset, the models trained using our intersection loss achieve better results than the state-of-the-art by significantly decreasing the number of hand intersections while lowering the mean per-joint positional error. Additionally, we demonstrate superior performance for 3D hand uplift on Re:InterHand and SMILE datasets and show reduced hand-to-hand intersections for complex domains such as sign-language pose estimation.
