Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation
Xianghui Xie, Bharat Lal Bhatnagar, Jan Eric Lenssen, Gerard Pons-Moll
TL;DR
This work tackles template-free reconstruction of 3D human–object interactions from a single RGB image by introducing ProciGen, a scalable procedural data generator that yields over 1M synthetic interactions across thousands of object shapes, and HDM, a two-stage image-conditioned diffusion framework. The first stage jointly predicts a human–object scene; the second stage refines separate human and object shapes using cross-attention-based diffusion models while preserving interaction context. Empirical results on BEHAVE and InterCap show that HDM trained with ProciGen outperforms template-based (CHORE) and template-free (PC$^2$) baselines and generalizes to unseen objects and wild images like COCO. The approach demonstrates a scalable, template-free pathway to realistic 3D avatars and robotics applications, with code and data released for future research.
Abstract
Reconstructing human-object interaction in 3D from a single RGB image is a challenging task and existing data driven methods do not generalize beyond the objects present in the carefully curated 3D interaction datasets. Capturing large-scale real data to learn strong interaction and 3D shape priors is very expensive due to the combinatorial nature of human-object interactions. In this paper, we propose ProciGen (Procedural interaction Generation), a method to procedurally generate datasets with both, plausible interaction and diverse object variation. We generate 1M+ human-object interaction pairs in 3D and leverage this large-scale data to train our HDM (Hierarchical Diffusion Model), a novel method to reconstruct interacting human and unseen objects, without any templates. Our HDM is an image-conditioned diffusion model that learns both realistic interaction and highly accurate human and object shapes. Experiments show that our HDM trained with ProciGen significantly outperforms prior methods that requires template meshes and that our dataset allows training methods with strong generalization ability to unseen object instances. Our code and data are released.
