Table of Contents
Fetching ...

Kinematics-based 3D Human-Object Interaction Reconstruction from Single View

Yuhang Chen, Chenxing Wang

TL;DR

This work tackles 3D human-object interaction reconstruction from a single-view video, where depth ambiguity and occlusion impede purely data-driven approaches. It introduces a neural-hybrid inverse kinematics framework that treats the human body as a kinematic chain and drives joints toward predicted contact regions on objects. A CRRNet is proposed to robustly detect contact regions from monocular video, supplying targets for the IK solver. On BEHAVE, the method achieves state-of-the-art accuracy, demonstrates online optimization without offline training, and offers portability to integrate with other HOI systems.

Abstract

Reconstructing 3D human-object interaction (HOI) from single-view RGB images is challenging due to the absence of depth information and potential occlusions. Existing methods simply predict the body poses merely rely on network training on some indoor datasets, which cannot guarantee the rationality of the results if some body parts are invisible due to occlusions that appear easily. Inspired by the end-effector localization task in robotics, we propose a kinematics-based method that can drive the joints of human body to the human-object contact regions accurately. After an improved forward kinematics algorithm is proposed, the Multi-Layer Perceptron is introduced into the solution of inverse kinematics process to determine the poses of joints, which achieves precise results than the commonly-used numerical methods in robotics. Besides, a Contact Region Recognition Network (CRRNet) is also proposed to robustly determine the contact regions using a single-view video. Experimental results demonstrate that our method outperforms the state-of-the-art on benchmark BEHAVE. Additionally, our approach shows good portability and can be seamlessly integrated into other methods for optimizations.

Kinematics-based 3D Human-Object Interaction Reconstruction from Single View

TL;DR

This work tackles 3D human-object interaction reconstruction from a single-view video, where depth ambiguity and occlusion impede purely data-driven approaches. It introduces a neural-hybrid inverse kinematics framework that treats the human body as a kinematic chain and drives joints toward predicted contact regions on objects. A CRRNet is proposed to robustly detect contact regions from monocular video, supplying targets for the IK solver. On BEHAVE, the method achieves state-of-the-art accuracy, demonstrates online optimization without offline training, and offers portability to integrate with other HOI systems.

Abstract

Reconstructing 3D human-object interaction (HOI) from single-view RGB images is challenging due to the absence of depth information and potential occlusions. Existing methods simply predict the body poses merely rely on network training on some indoor datasets, which cannot guarantee the rationality of the results if some body parts are invisible due to occlusions that appear easily. Inspired by the end-effector localization task in robotics, we propose a kinematics-based method that can drive the joints of human body to the human-object contact regions accurately. After an improved forward kinematics algorithm is proposed, the Multi-Layer Perceptron is introduced into the solution of inverse kinematics process to determine the poses of joints, which achieves precise results than the commonly-used numerical methods in robotics. Besides, a Contact Region Recognition Network (CRRNet) is also proposed to robustly determine the contact regions using a single-view video. Experimental results demonstrate that our method outperforms the state-of-the-art on benchmark BEHAVE. Additionally, our approach shows good portability and can be seamlessly integrated into other methods for optimizations.
Paper Structure (16 sections, 18 equations, 11 figures, 2 tables)

This paper contains 16 sections, 18 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Given a challenging occluded-view image, compared to a current SOTA method VisTracker, our model is able to produce more accurate 3D human body pose and position, outperforming pervious work on the standard benchmark.
  • Figure 2: The framework of our proposed method, consisting of: (1) Mesh reconstruction: given a video sequence, the meshes of objects and human bodies are estimated with rough poses by using existing method; (2)Contact Region Recognition: the features of input image sequence and originally estimated point cloud are integrated to estimate the contact regions on the object surface. (3) Human pose optimization: a neural-based kinematics model is proposed to actively guide the human body to reach a contact region.
  • Figure 3: The definition of kinematics chain and joints type. In (a), we define five kinematic chains: left / right arm, left / right leg, body; In (b), we define four joint types: target joint $j$, rotation, translation, fixed. We activate the corresponding kinematic chain based on the type of contact regions and assign the corresponding types to the joints.
  • Figure 4: The architecture of our proposed IK solution.
  • Figure 5: The comparison between our Neural Solver and naive MLP.
  • ...and 6 more figures