Table of Contents
Fetching ...

Joint Reconstruction of 3D Human and Object via Contact-Based Refinement Transformer

Hyeongjin Nam, Daniel Sungho Jung, Gyeongsik Moon, Kyoung Mu Lee

TL;DR

CONTHO introduces a joint 3D human–object reconstruction framework that leverages contact information as a central cue. It combines 3D-guided contact estimation (ContactFormer) with a contact-based refinement Transformer (CRFormer) to refine both human and object meshes, using initial 3D reconstructions as guidance and a masking strategy to focus on interaction regions. The approach achieves state-of-the-art results on BEHAVE and InterCap for both contact estimation and joint 3D reconstruction, while significantly reducing inference time compared with optimization-based methods. This work advances interactive AR/VR and robotics by producing accurate, efficient HOI representations from a single image.

Abstract

Human-object contact serves as a strong cue to understand how humans physically interact with objects. Nevertheless, it is not widely explored to utilize human-object contact information for the joint reconstruction of 3D human and object from a single image. In this work, we present a novel joint 3D human-object reconstruction method (CONTHO) that effectively exploits contact information between humans and objects. There are two core designs in our system: 1) 3D-guided contact estimation and 2) contact-based 3D human and object refinement. First, for accurate human-object contact estimation, CONTHO initially reconstructs 3D humans and objects and utilizes them as explicit 3D guidance for contact estimation. Second, to refine the initial reconstructions of 3D human and object, we propose a novel contact-based refinement Transformer that effectively aggregates human features and object features based on the estimated human-object contact. The proposed contact-based refinement prevents the learning of erroneous correlation between human and object, which enables accurate 3D reconstruction. As a result, our CONTHO achieves state-of-the-art performance in both human-object contact estimation and joint reconstruction of 3D human and object. The code is publicly available at https://github.com/dqj5182/CONTHO_RELEASE.

Joint Reconstruction of 3D Human and Object via Contact-Based Refinement Transformer

TL;DR

CONTHO introduces a joint 3D human–object reconstruction framework that leverages contact information as a central cue. It combines 3D-guided contact estimation (ContactFormer) with a contact-based refinement Transformer (CRFormer) to refine both human and object meshes, using initial 3D reconstructions as guidance and a masking strategy to focus on interaction regions. The approach achieves state-of-the-art results on BEHAVE and InterCap for both contact estimation and joint 3D reconstruction, while significantly reducing inference time compared with optimization-based methods. This work advances interactive AR/VR and robotics by producing accurate, efficient HOI representations from a single image.

Abstract

Human-object contact serves as a strong cue to understand how humans physically interact with objects. Nevertheless, it is not widely explored to utilize human-object contact information for the joint reconstruction of 3D human and object from a single image. In this work, we present a novel joint 3D human-object reconstruction method (CONTHO) that effectively exploits contact information between humans and objects. There are two core designs in our system: 1) 3D-guided contact estimation and 2) contact-based 3D human and object refinement. First, for accurate human-object contact estimation, CONTHO initially reconstructs 3D humans and objects and utilizes them as explicit 3D guidance for contact estimation. Second, to refine the initial reconstructions of 3D human and object, we propose a novel contact-based refinement Transformer that effectively aggregates human features and object features based on the estimated human-object contact. The proposed contact-based refinement prevents the learning of erroneous correlation between human and object, which enables accurate 3D reconstruction. As a result, our CONTHO achieves state-of-the-art performance in both human-object contact estimation and joint reconstruction of 3D human and object. The code is publicly available at https://github.com/dqj5182/CONTHO_RELEASE.
Paper Structure (23 sections, 4 equations, 14 figures, 6 tables)

This paper contains 23 sections, 4 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Overview of CONTHO. Our proposed CONTHO estimates human-object contact maps through our proposed ContactFormer and exploits the contact maps for 3D human and object refinement with CRFormer. The green color indicates human-object contact regions estimated from ContactFormer.
  • Figure 2: Example of undesired human-object correlation. Due to the undesired human-object correlation in the Transformer baseline, the monitor display always faces toward the human head, which should not move as in the images. Our proposed CRFormer effectively alleviates the undesired correlation, resulting in accurate reconstruction results.
  • Figure 3: Overall pipeline of CONTHO. Our method first reconstructs 3D human and object meshes ($\mathbf{M}_{\text{h}}$ and $\mathbf{M}_{\text{o}}$). Then, the initial meshes are utilized to construct 3D vertex features ($\mathbf{F}_{\text{vh}}$ and $\mathbf{F}_{\text{vo}}$). Subsequently, ContactFormer estimates human-object contact maps ($\mathbf{C}_{\text{h}}$ and $\mathbf{C}_{\text{o}}$) from the 3D vertex features. Lastly, CRFormer aggregates the 3D vertex features based on the estimated contact maps to provide refined human and object meshes ($\mathbf{M}_{\text{h}}^{\ast}$ and $\mathbf{M}_{\text{o}}^{\ast}$). The green color indicates the estimated contacting regions.
  • Figure 4: Analysis of undesired human-object correlation on BEHAVE bhatnagar2022behave. We conduct a sensitivity test, inspecting which region is sensitive in reconstruction, for Transformer baseline and our CRFormer. In the Transformer baseline, the object errors are sensitive to human regions not actually related to human-object interaction, as a result of undesired correlation. In our CRFormer, the object errors are mostly sensitive around regions containing human-object contact.
  • Figure 5: Qualitative comparison of human-object contact estimation with BSTRO huang2022capturing and DECO tripathi2023deco, on BEHAVE bhatnagar2022behave (left) and InterCap huang2022intercap (right). The green color indicates the contacting regions.
  • ...and 9 more figures