Table of Contents
Fetching ...

Hand-Centric Motion Refinement for 3D Hand-Object Interaction via Hierarchical Spatial-Temporal Modeling

Yuze Hao, Jianrong Zhang, Tao Zhuo, Fuan Wen, Hehe Fan

TL;DR

A hand-centric representation is designed to describe the dynamic spatial-temporal relation between hands and objects and a new architecture is proposed that models the spatial and temporal structure in a hierarchical manner to capture the dynamic clues of hand-object interaction.

Abstract

Hands are the main medium when people interact with the world. Generating proper 3D motion for hand-object interaction is vital for applications such as virtual reality and robotics. Although grasp tracking or object manipulation synthesis can produce coarse hand motion, this kind of motion is inevitably noisy and full of jitter. To address this problem, we propose a data-driven method for coarse motion refinement. First, we design a hand-centric representation to describe the dynamic spatial-temporal relation between hands and objects. Compared to the object-centric representation, our hand-centric representation is straightforward and does not require an ambiguous projection process that converts object-based prediction into hand motion. Second, to capture the dynamic clues of hand-object interaction, we propose a new architecture that models the spatial and temporal structure in a hierarchical manner. Extensive experiments demonstrate that our method outperforms previous methods by a noticeable margin.

Hand-Centric Motion Refinement for 3D Hand-Object Interaction via Hierarchical Spatial-Temporal Modeling

TL;DR

A hand-centric representation is designed to describe the dynamic spatial-temporal relation between hands and objects and a new architecture is proposed that models the spatial and temporal structure in a hierarchical manner to capture the dynamic clues of hand-object interaction.

Abstract

Hands are the main medium when people interact with the world. Generating proper 3D motion for hand-object interaction is vital for applications such as virtual reality and robotics. Although grasp tracking or object manipulation synthesis can produce coarse hand motion, this kind of motion is inevitably noisy and full of jitter. To address this problem, we propose a data-driven method for coarse motion refinement. First, we design a hand-centric representation to describe the dynamic spatial-temporal relation between hands and objects. Compared to the object-centric representation, our hand-centric representation is straightforward and does not require an ambiguous projection process that converts object-based prediction into hand motion. Second, to capture the dynamic clues of hand-object interaction, we propose a new architecture that models the spatial and temporal structure in a hierarchical manner. Extensive experiments demonstrate that our method outperforms previous methods by a noticeable margin.
Paper Structure (27 sections, 7 equations, 4 figures, 5 tables)

This paper contains 27 sections, 7 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Comparing with TOCH zhou2022toch, our method has three advantages. (1) To capture the relation between hand and object, the existing method first emits rays from the object and then collects points that arrive at the hand. In contrast, we propose a straightforward hand-centric representation, which directly models the hand-object interaction. (2) Our hierarchical spatial-temporal architecture better captures dynamic information across different scales than the fixed-scale design in TOCH. (3) Due to the direct hand-object representation, our method does not require the additional post-process that converts object-centric representation into hand motions.
  • Figure 2: Overview of our framework. Given a perturbed interaction sequence, we first convert the sequence into our hand-centric correspondence representation. Then the representations are fed into a hierarchical spatial encoder to capture the local-global spatial features for each frame. Next, the features are passed through a hierarchical temporal encoder to extract long-term and short-term dependencies across frames. Lastly, the refined sequences are obtained from a reconstruction decoder followed by a post-optimization.
  • Figure 3: Qualitative results for refining inconsistency pose (highlighted with blue arrows) in perturbed tracking sequence. The refined sequence of TOCH zhou2022toch exhibits improper grasping poses (highlighted with red boxes). In contrast, our reconstructions demonstrate a more plausible interacting pose.
  • Figure 4: Qualitative results for refining inter-penetration (highlighted with blue arrows) in perturbed tracking sequence. The refined sequence of TOCH zhou2022toch exhibits inadequate contact (highlighted with red boxes) while our results can achieve more realistic interaction.