Table of Contents
Fetching ...

GEARS: Local Geometry-aware Hand-object Interaction Synthesis

Keyang Zhou, Bharat Lal Bhatnagar, Jan Eric Lenssen, Gerard Pons-moll

TL;DR

GEARS tackles the generalization gap in hand-object interaction synthesis by introducing a joint-centered local geometry sensor that captures fine-grained object geometry around each hand joint and a spatio-temporal transformer to model inter-joint dependencies. The method coarse- Initializes joints, refines them with joint-local features, and fits the MANO hand model to generate a complete hand mesh sequence, augmented by a data-synthesis pipeline that turns static grasps into dynamic sequences. Quantitatively, GEARS outperforms baselines on GRAB and InterCap across multiple metrics, while ablations validate the critical role of the joint-displacement network and synthetic data in improving generalization. The approach enables more realistic, varied, and scalable hand-object interactions, with implications for digital humans, AR/VR, and robotics, and will release code and pretrained models.

Abstract

Generating realistic hand motion sequences in interaction with objects has gained increasing attention with the growing interest in digital humans. Prior work has illustrated the effectiveness of employing occupancy-based or distance-based virtual sensors to extract hand-object interaction features. Nonetheless, these methods show limited generalizability across object categories, shapes and sizes. We hypothesize that this is due to two reasons: 1) the limited expressiveness of employed virtual sensors, and 2) scarcity of available training data. To tackle this challenge, we introduce a novel joint-centered sensor designed to reason about local object geometry near potential interaction regions. The sensor queries for object surface points in the neighbourhood of each hand joint. As an important step towards mitigating the learning complexity, we transform the points from global frame to hand template frame and use a shared module to process sensor features of each individual joint. This is followed by a spatio-temporal transformer network aimed at capturing correlation among the joints in different dimensions. Moreover, we devise simple heuristic rules to augment the limited training sequences with vast static hand grasping samples. This leads to a broader spectrum of grasping types observed during training, in turn enhancing our model's generalization capability. We evaluate on two public datasets, GRAB and InterCap, where our method shows superiority over baselines both quantitatively and perceptually.

GEARS: Local Geometry-aware Hand-object Interaction Synthesis

TL;DR

GEARS tackles the generalization gap in hand-object interaction synthesis by introducing a joint-centered local geometry sensor that captures fine-grained object geometry around each hand joint and a spatio-temporal transformer to model inter-joint dependencies. The method coarse- Initializes joints, refines them with joint-local features, and fits the MANO hand model to generate a complete hand mesh sequence, augmented by a data-synthesis pipeline that turns static grasps into dynamic sequences. Quantitatively, GEARS outperforms baselines on GRAB and InterCap across multiple metrics, while ablations validate the critical role of the joint-displacement network and synthetic data in improving generalization. The approach enables more realistic, varied, and scalable hand-object interactions, with implications for digital humans, AR/VR, and robotics, and will release code and pretrained models.

Abstract

Generating realistic hand motion sequences in interaction with objects has gained increasing attention with the growing interest in digital humans. Prior work has illustrated the effectiveness of employing occupancy-based or distance-based virtual sensors to extract hand-object interaction features. Nonetheless, these methods show limited generalizability across object categories, shapes and sizes. We hypothesize that this is due to two reasons: 1) the limited expressiveness of employed virtual sensors, and 2) scarcity of available training data. To tackle this challenge, we introduce a novel joint-centered sensor designed to reason about local object geometry near potential interaction regions. The sensor queries for object surface points in the neighbourhood of each hand joint. As an important step towards mitigating the learning complexity, we transform the points from global frame to hand template frame and use a shared module to process sensor features of each individual joint. This is followed by a spatio-temporal transformer network aimed at capturing correlation among the joints in different dimensions. Moreover, we devise simple heuristic rules to augment the limited training sequences with vast static hand grasping samples. This leads to a broader spectrum of grasping types observed during training, in turn enhancing our model's generalization capability. We evaluate on two public datasets, GRAB and InterCap, where our method shows superiority over baselines both quantitatively and perceptually.
Paper Structure (16 sections, 13 equations, 5 figures, 3 tables)

This paper contains 16 sections, 13 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: An overview of our method. The input consists of the hand trajectory, object trajectory and object template mesh. For each time frame, the object mesh is cropped with a cube-shaped virtual sensor positioned and oriented based on the wrist. The cropped object points together with the hand trajectory are fed to the Joint Initialization Network to predict coarse joints locations. We then place more fine-grained geometry sensors at each joint to extract joint-local object features. The features are subsequently processed by the Joint Displacement Network to refine the initialized joints. Finally, we fit MANO hand model MANO:SIGGRAPHASIA:2017 to the joints to get the hand mesh sequence.
  • Figure 2: Visualization of our joint-local geometry sensor. (Left) Given the joints positions and the object mesh, we sample points on the object surface within a specified radius centered at each joint. The object points are represented in a joint-local frame. (Right) We transform the sampled object points from global frame to the canonical frame defined by the MANO template hand.
  • Figure 3: An illustration of spatial and temporal attention networks. We first process the features of each joint by PointNet. For spatial attention, every joint attends to every other joint of the same hand. While for temporal attention, a joint in one frame attends to the same joint in every other frame.
  • Figure 4: A sample training sequence synthesized by our heuristic rule. At the rightmost side of the time axis is a static grasping pose from ObMan hasson19obman. We synthesize intermediate poses by interpolating joint angles from the mean MANO pose.
  • Figure 5: Qualitative results on GRAB (top row) and InterCap (bottom two rows). GEARS makes effective contact with the objects while avoiding hand-object inter-penetration.