End-to-End HOI Reconstruction Transformer with Graph-based Encoding
Zhenrong Wang, Qi Zheng, Sihan Ma, Maosheng Ye, Yibing Zhan, Dongjiang Li
TL;DR
HOI reconstruction from a single image is challenging due to the need to balance global mesh structure with local interaction details. The authors propose HOI-TG, an end-to-end transformer augmented with graph-based encoding to implicitly model human–object interactions, combining global attention with local topology. The method achieves state-of-the-art results on BEHAVE and InterCap, substantially improving 3D mesh reconstruction and contact quality over prior explicit constraints and other transformer-based approaches. This work demonstrates that implicit interaction modeling with graph-aware fusion can effectively capture complex HOI relationships and generalize to in-the-wild scenarios, while also highlighting remaining challenges such as lying poses and symmetric objects.
Abstract
With the diversification of human-object interaction (HOI) applications and the success of capturing human meshes, HOI reconstruction has gained widespread attention. Existing mainstream HOI reconstruction methods often rely on explicitly modeling interactions between humans and objects. However, such a way leads to a natural conflict between 3D mesh reconstruction, which emphasizes global structure, and fine-grained contact reconstruction, which focuses on local details. To address the limitations of explicit modeling, we propose the End-to-End HOI Reconstruction Transformer with Graph-based Encoding (HOI-TG). It implicitly learns the interaction between humans and objects by leveraging self-attention mechanisms. Within the transformer architecture, we devise graph residual blocks to aggregate the topology among vertices of different spatial structures. This dual focus effectively balances global and local representations. Without bells and whistles, HOI-TG achieves state-of-the-art performance on BEHAVE and InterCap datasets. Particularly on the challenging InterCap dataset, our method improves the reconstruction results for human and object meshes by 8.9% and 8.6%, respectively.
