Table of Contents
Fetching ...

End-to-End HOI Reconstruction Transformer with Graph-based Encoding

Zhenrong Wang, Qi Zheng, Sihan Ma, Maosheng Ye, Yibing Zhan, Dongjiang Li

TL;DR

HOI reconstruction from a single image is challenging due to the need to balance global mesh structure with local interaction details. The authors propose HOI-TG, an end-to-end transformer augmented with graph-based encoding to implicitly model human–object interactions, combining global attention with local topology. The method achieves state-of-the-art results on BEHAVE and InterCap, substantially improving 3D mesh reconstruction and contact quality over prior explicit constraints and other transformer-based approaches. This work demonstrates that implicit interaction modeling with graph-aware fusion can effectively capture complex HOI relationships and generalize to in-the-wild scenarios, while also highlighting remaining challenges such as lying poses and symmetric objects.

Abstract

With the diversification of human-object interaction (HOI) applications and the success of capturing human meshes, HOI reconstruction has gained widespread attention. Existing mainstream HOI reconstruction methods often rely on explicitly modeling interactions between humans and objects. However, such a way leads to a natural conflict between 3D mesh reconstruction, which emphasizes global structure, and fine-grained contact reconstruction, which focuses on local details. To address the limitations of explicit modeling, we propose the End-to-End HOI Reconstruction Transformer with Graph-based Encoding (HOI-TG). It implicitly learns the interaction between humans and objects by leveraging self-attention mechanisms. Within the transformer architecture, we devise graph residual blocks to aggregate the topology among vertices of different spatial structures. This dual focus effectively balances global and local representations. Without bells and whistles, HOI-TG achieves state-of-the-art performance on BEHAVE and InterCap datasets. Particularly on the challenging InterCap dataset, our method improves the reconstruction results for human and object meshes by 8.9% and 8.6%, respectively.

End-to-End HOI Reconstruction Transformer with Graph-based Encoding

TL;DR

HOI reconstruction from a single image is challenging due to the need to balance global mesh structure with local interaction details. The authors propose HOI-TG, an end-to-end transformer augmented with graph-based encoding to implicitly model human–object interactions, combining global attention with local topology. The method achieves state-of-the-art results on BEHAVE and InterCap, substantially improving 3D mesh reconstruction and contact quality over prior explicit constraints and other transformer-based approaches. This work demonstrates that implicit interaction modeling with graph-aware fusion can effectively capture complex HOI relationships and generalize to in-the-wild scenarios, while also highlighting remaining challenges such as lying poses and symmetric objects.

Abstract

With the diversification of human-object interaction (HOI) applications and the success of capturing human meshes, HOI reconstruction has gained widespread attention. Existing mainstream HOI reconstruction methods often rely on explicitly modeling interactions between humans and objects. However, such a way leads to a natural conflict between 3D mesh reconstruction, which emphasizes global structure, and fine-grained contact reconstruction, which focuses on local details. To address the limitations of explicit modeling, we propose the End-to-End HOI Reconstruction Transformer with Graph-based Encoding (HOI-TG). It implicitly learns the interaction between humans and objects by leveraging self-attention mechanisms. Within the transformer architecture, we devise graph residual blocks to aggregate the topology among vertices of different spatial structures. This dual focus effectively balances global and local representations. Without bells and whistles, HOI-TG achieves state-of-the-art performance on BEHAVE and InterCap datasets. Particularly on the challenging InterCap dataset, our method improves the reconstruction results for human and object meshes by 8.9% and 8.6%, respectively.

Paper Structure

This paper contains 26 sections, 9 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Comparison between existing explicit contact constraints for HOI reconstruction and our implicit contact modeling.
  • Figure 2: Overview of our HOI-TG. a. Pipeline draws the process of HOI reconstruction. Given the input image and human & object segmentations, we extract the image feature and generate an initial human mesh using the ResNet50 backbone. Then, we prepare joint queries, vertex queries, and object queries by concatenating grid sampling features and per-vertex 3D coordinates. Based on the queries, HOI reconstruction transformer blocks reconstruct human joints & vertices, and object mesh. Final HOI meshes are calculated by upsampling and rigid transformation. b. HOI Reconstruction Transformer Block contains a Human Graph Residual Block and an Object Graph Residual Block for separate encoding for humans and objects. c. Encoder shows the change of hidden dimensions throughout HOI-TG.
  • Figure 3: Graph adjacency of the human and a specific object. The adjacency contains connectivity and distance information. Warmer colors indicate vertices with higher centrality.
  • Figure 4: Qualitative comparison of 3D human and object reconstruction with CONTHO joint on BEHAVE behave. Our HOI-TG achieves higher accuracy regarding the relative poses between the human and the object while also reducing instances of mesh penetration.
  • Figure 5: Visualization of the attention distribution (HOI att.) between human mesh vertices and the object and corresponding reconstruction results (HOI recon.) from our HOI-TG. The brighter color indicates more intensive attention.
  • ...and 5 more figures