Table of Contents
Fetching ...

Ins-HOI: Instance Aware Human-Object Interactions Recovery

Jiajun Zhang, Yuxiang Zhang, Hongwen Zhang, Xiao Zhou, Boyao Zhou, Ruizhi Shao, Zonghai Hu, Yebin Liu

TL;DR

This work proposes an end-to-end Instance-aware Human-Object Interactions recovery (Ins-HOI) framework by introducing an instance-level occupancy field representation, and proposes a complementary training strategy that leverages synthetic data to introduce instance-level shape priors, enabling the disentanglement of occupancy fields for different instances.

Abstract

Accurately modeling detailed interactions between human/hand and object is an appealing yet challenging task. Current multi-view capture systems are only capable of reconstructing multiple subjects into a single, unified mesh, which fails to model the states of each instance individually during interactions. To address this, previous methods use template-based representations to track human/hand and object. However, the quality of the reconstructions is limited by the descriptive capabilities of the templates so that these methods are inherently struggle with geometry details, pressing deformations and invisible contact surfaces. In this work, we propose an end-to-end Instance-aware Human-Object Interactions recovery (Ins-HOI) framework by introducing an instance-level occupancy field representation. However, the real-captured data is presented as a holistic mesh, unable to provide instance-level supervision. To address this, we further propose a complementary training strategy that leverages synthetic data to introduce instance-level shape priors, enabling the disentanglement of occupancy fields for different instances. Specifically, synthetic data, created by randomly combining individual scans of humans/hands and objects, guides the network to learn a coarse prior of instances. Meanwhile, real-captured data helps in learning the overall geometry and restricting interpenetration in contact areas. As demonstrated in experiments, our method Ins-HOI supports instance-level reconstruction and provides reasonable and realistic invisible contact surfaces even in cases of extremely close interaction. To facilitate the research of this task, we collect a large-scale, high-fidelity 3D scan dataset, including 5.2k high-quality scans with real-world human-chair and hand-object interactions. The code and data will be public for research purposes.

Ins-HOI: Instance Aware Human-Object Interactions Recovery

TL;DR

This work proposes an end-to-end Instance-aware Human-Object Interactions recovery (Ins-HOI) framework by introducing an instance-level occupancy field representation, and proposes a complementary training strategy that leverages synthetic data to introduce instance-level shape priors, enabling the disentanglement of occupancy fields for different instances.

Abstract

Accurately modeling detailed interactions between human/hand and object is an appealing yet challenging task. Current multi-view capture systems are only capable of reconstructing multiple subjects into a single, unified mesh, which fails to model the states of each instance individually during interactions. To address this, previous methods use template-based representations to track human/hand and object. However, the quality of the reconstructions is limited by the descriptive capabilities of the templates so that these methods are inherently struggle with geometry details, pressing deformations and invisible contact surfaces. In this work, we propose an end-to-end Instance-aware Human-Object Interactions recovery (Ins-HOI) framework by introducing an instance-level occupancy field representation. However, the real-captured data is presented as a holistic mesh, unable to provide instance-level supervision. To address this, we further propose a complementary training strategy that leverages synthetic data to introduce instance-level shape priors, enabling the disentanglement of occupancy fields for different instances. Specifically, synthetic data, created by randomly combining individual scans of humans/hands and objects, guides the network to learn a coarse prior of instances. Meanwhile, real-captured data helps in learning the overall geometry and restricting interpenetration in contact areas. As demonstrated in experiments, our method Ins-HOI supports instance-level reconstruction and provides reasonable and realistic invisible contact surfaces even in cases of extremely close interaction. To facilitate the research of this task, we collect a large-scale, high-fidelity 3D scan dataset, including 5.2k high-quality scans with real-world human-chair and hand-object interactions. The code and data will be public for research purposes.
Paper Structure (30 sections, 9 equations, 15 figures, 3 tables)

This paper contains 30 sections, 9 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Our instance-level implicit-based approach achieves accurate reconstruction of the geometry and invisible contact areas. In contrast, tracking and optimization-based methods bhatnagar22behave, as well as marker-based methods jiang2023chairs, rely on human parametric template and present inaccuracies in interaction region and lack fine-detailed geometry.
  • Figure 2: Examples of 3D scans from the Ins-Sit and Ins-Grasp dataset. It contains high-fidelity geometries and textures. Ins-sit captures a wide range of sitting postures and diverse clothing style, whereas Ins-Grasp includes a broader range of objects.
  • Figure 3: Pipeline of virtual multi-view fusion method for automatic 3D scan semantic segmentation.
  • Figure 4: Examples and pipeline of our synthetic data generation process. For synthetic data of human-chair interactions, it comprises two distinct types: Syn_s and Syn_r.
  • Figure 5: Overview of the method: (a) showcases the synthetic data augmentation process using THuman and Ins-Sit dataset to form a training dataset. (b) highlights how the training components provide unique guidance for complementary learning (blue and pink denote the human and chair meshes; purple and red indicate the union and intersection). (c) depicts our benchmark Ins-HOI, which given sparse view inputs to produce instance-level human-object recovery via an end-to-end approach.
  • ...and 10 more figures