Table of Contents
Fetching ...

Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction on Monocular RGB Video

Weichao Zhao, Hezhen Hu, Wengang Zhou, Li li, Houqiang Li

TL;DR

This work leverages temporal context to complement insufficient information provided by the single frame and design a novel temporal framework with a temporal constraint for interacting hand motion smoothness and proposes an interpenetration detection module to produce kinetically plausible interacting hands without physical collisions.

Abstract

Reconstructing interacting hands from monocular RGB data is a challenging task, as it involves many interfering factors, e.g. self- and mutual occlusion and similar textures. Previous works only leverage information from a single RGB image without modeling their physically plausible relation, which leads to inferior reconstruction results. In this work, we are dedicated to explicitly exploiting spatial-temporal information to achieve better interacting hand reconstruction. On one hand, we leverage temporal context to complement insufficient information provided by the single frame, and design a novel temporal framework with a temporal constraint for interacting hand motion smoothness. On the other hand, we further propose an interpenetration detection module to produce kinetically plausible interacting hands without physical collisions. Extensive experiments are performed to validate the effectiveness of our proposed framework, which achieves new state-of-the-art performance on public benchmarks.

Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction on Monocular RGB Video

TL;DR

This work leverages temporal context to complement insufficient information provided by the single frame and design a novel temporal framework with a temporal constraint for interacting hand motion smoothness and proposes an interpenetration detection module to produce kinetically plausible interacting hands without physical collisions.

Abstract

Reconstructing interacting hands from monocular RGB data is a challenging task, as it involves many interfering factors, e.g. self- and mutual occlusion and similar textures. Previous works only leverage information from a single RGB image without modeling their physically plausible relation, which leads to inferior reconstruction results. In this work, we are dedicated to explicitly exploiting spatial-temporal information to achieve better interacting hand reconstruction. On one hand, we leverage temporal context to complement insufficient information provided by the single frame, and design a novel temporal framework with a temporal constraint for interacting hand motion smoothness. On the other hand, we further propose an interpenetration detection module to produce kinetically plausible interacting hands without physical collisions. Extensive experiments are performed to validate the effectiveness of our proposed framework, which achieves new state-of-the-art performance on public benchmarks.
Paper Structure (15 sections, 10 equations, 12 figures, 4 tables)

This paper contains 15 sections, 10 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: An illustration of our proposed framework. Given a cropped RGB sequence, the frame encoder first distills image features for each frame. Then the multi-scale feature extractor extracts multi-scale feature sequences for temporal modeling. The temporal encoder refines both hand features with temporal contexts. Finally, the MANO decoder regresses the 3D coordinates of two hand surface vertices, and the interpenetration detector detects whether two hands occur collision during training. The weights of the frame encoder and multi-scale feature extractor are shared among RGB frames.
  • Figure 2: Illustration of the multi-scale feature extractor in our proposed framework.
  • Figure 3: Quantitative comparison of our proposed method with the previous methods on InterHand2.6M moon2020interhand2 and HIC hasson2019learning datasets. The horizontal axis indicates the error threshold for interacting hand, while the vertical axis indicates the 3D Percentage of Correct Keypoints (3D PCK).
  • Figure 4: Comparison of qualitative results between the state-of-the-art method, ITH-3D zhang2021interacting and our proposed method. Our method produces more accurate and smooth interacting hand poses, while ITH-3D zhang2021interacting produces more collisions between both hands and worse hand poses.
  • Figure 5: Qualitative ablation study on InterHand2.6M. 'w/o Interpenetration Constraint' means removing the spatial information from the full model. The dashed red circles in alternative view display the interpenetration region between both hands. The result shows the effectiveness of spatial information.
  • ...and 7 more figures