Table of Contents
Fetching ...

Two Hands Are Better Than One: Resolving Hand to Hand Intersections via Occupancy Networks

Maksym Ivashechkin, Oscar Mendez, Richard Bowden

TL;DR

This work tackles two-hand 3D pose estimation under mutual occlusions by introducing a continuous occupancy network that represents hand volume and a differentiable intersection loss to discourage hand–hand intersections. It couples this with a novel, watertight hand mesh parameterization that reduces mesh complexity and enables reliable inside/outside tests via ray casting. The approach achieves state-of-the-art reduction in hand intersections on InterHand2.6M while maintaining or improving MPJPE, and demonstrates robust intersection suppression in the Re:InterHand and SMILE datasets as well as in-the-wild scenarios. The framework is modular and can enhance existing single-hand estimators, offering practical benefits for fine-grained hand interactions in sign language and other two-hand tasks.

Abstract

3D hand pose estimation from images has seen considerable interest from the literature, with new methods improving overall 3D accuracy. One current challenge is to address hand-to-hand interaction where self-occlusions and finger articulation pose a significant problem to estimation. Little work has applied physical constraints that minimize the hand intersections that occur as a result of noisy estimation. This work addresses the intersection of hands by exploiting an occupancy network that represents the hand's volume as a continuous manifold. This allows us to model the probability distribution of points being inside a hand. We designed an intersection loss function to minimize the likelihood of hand-to-point intersections. Moreover, we propose a new hand mesh parameterization that is superior to the commonly used MANO model in many respects including lower mesh complexity, underlying 3D skeleton extraction, watertightness, etc. On the benchmark InterHand2.6M dataset, the models trained using our intersection loss achieve better results than the state-of-the-art by significantly decreasing the number of hand intersections while lowering the mean per-joint positional error. Additionally, we demonstrate superior performance for 3D hand uplift on Re:InterHand and SMILE datasets and show reduced hand-to-hand intersections for complex domains such as sign-language pose estimation.

Two Hands Are Better Than One: Resolving Hand to Hand Intersections via Occupancy Networks

TL;DR

This work tackles two-hand 3D pose estimation under mutual occlusions by introducing a continuous occupancy network that represents hand volume and a differentiable intersection loss to discourage hand–hand intersections. It couples this with a novel, watertight hand mesh parameterization that reduces mesh complexity and enables reliable inside/outside tests via ray casting. The approach achieves state-of-the-art reduction in hand intersections on InterHand2.6M while maintaining or improving MPJPE, and demonstrates robust intersection suppression in the Re:InterHand and SMILE datasets as well as in-the-wild scenarios. The framework is modular and can enhance existing single-hand estimators, offering practical benefits for fine-grained hand interactions in sign language and other two-hand tasks.

Abstract

3D hand pose estimation from images has seen considerable interest from the literature, with new methods improving overall 3D accuracy. One current challenge is to address hand-to-hand interaction where self-occlusions and finger articulation pose a significant problem to estimation. Little work has applied physical constraints that minimize the hand intersections that occur as a result of noisy estimation. This work addresses the intersection of hands by exploiting an occupancy network that represents the hand's volume as a continuous manifold. This allows us to model the probability distribution of points being inside a hand. We designed an intersection loss function to minimize the likelihood of hand-to-point intersections. Moreover, we propose a new hand mesh parameterization that is superior to the commonly used MANO model in many respects including lower mesh complexity, underlying 3D skeleton extraction, watertightness, etc. On the benchmark InterHand2.6M dataset, the models trained using our intersection loss achieve better results than the state-of-the-art by significantly decreasing the number of hand intersections while lowering the mean per-joint positional error. Additionally, we demonstrate superior performance for 3D hand uplift on Re:InterHand and SMILE datasets and show reduced hand-to-hand intersections for complex domains such as sign-language pose estimation.
Paper Structure (17 sections, 1 equation, 8 figures, 4 tables)

This paper contains 17 sections, 1 equation, 8 figures, 4 tables.

Figures (8)

  • Figure 1: The figure demonstrates the pipeline for accurate 3D interacting hands estimation. The input image is processed via a CNN model (e.g., ResNet resnet, MediaPipe, etc.) that enables the extraction of image features or 2D keypoints necessary to uplift hands into 3D. Note our approach is invariant to the backbone used. Afterward, a pre-trained occupancy network with frozen weights is conditioned via the right hand, and the intersections are tested with the left hand, and optionally vice-versa with the hands flipped as illustrated. The red and green edges highlight the right and left hands, respectively. Light green and light orange points visualize the density of hands determined by the occupancy network and their size emphasizes the likelihood of intersection. Since both hands are fully differentiable with respect to the occupancy and CNN networks, it provides efficient backpropagation of the intersection loss. The source image is taken from the InterHand2.6M dataset.
  • Figure 2: Comparison of plain (left) and complex (right) watertight hand meshes generated with our parameterized mesh model. The green pose (underneath the orange envelope) is found using forward kinematics (FK) that combine angles and bone length. The yellow points are also obtained via FK with pre-determined offsets from the underlying skeleton, the red triangles span the entire hand surface.
  • Figure 3: Comparison of applying different point sets to check intersections: a) sparse skeleton, b) skeleton with additional points along the edges, c) skeleton with mesh surface points. The yellow points show the density of the right hand, and the black points highlight intersections.
  • Figure 4: This figure shows a comparison of the failed MANO (red) and our (green) meshes fitted to the InterHand2.6 3D hand joints. The black circles on the MANO meshes highlight specific problems of the MANO hand's appearance, such as twisted fingers, unrealistic shape, incorrect finger orientation, etc.
  • Figure 5: Correlation trend of per-point intersection probability found via occupancy network and ray-casting algorithm.
  • ...and 3 more figures