Table of Contents
Fetching ...

Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation

Xianghui Xie, Bharat Lal Bhatnagar, Jan Eric Lenssen, Gerard Pons-Moll

TL;DR

This work tackles template-free reconstruction of 3D human–object interactions from a single RGB image by introducing ProciGen, a scalable procedural data generator that yields over 1M synthetic interactions across thousands of object shapes, and HDM, a two-stage image-conditioned diffusion framework. The first stage jointly predicts a human–object scene; the second stage refines separate human and object shapes using cross-attention-based diffusion models while preserving interaction context. Empirical results on BEHAVE and InterCap show that HDM trained with ProciGen outperforms template-based (CHORE) and template-free (PC$^2$) baselines and generalizes to unseen objects and wild images like COCO. The approach demonstrates a scalable, template-free pathway to realistic 3D avatars and robotics applications, with code and data released for future research.

Abstract

Reconstructing human-object interaction in 3D from a single RGB image is a challenging task and existing data driven methods do not generalize beyond the objects present in the carefully curated 3D interaction datasets. Capturing large-scale real data to learn strong interaction and 3D shape priors is very expensive due to the combinatorial nature of human-object interactions. In this paper, we propose ProciGen (Procedural interaction Generation), a method to procedurally generate datasets with both, plausible interaction and diverse object variation. We generate 1M+ human-object interaction pairs in 3D and leverage this large-scale data to train our HDM (Hierarchical Diffusion Model), a novel method to reconstruct interacting human and unseen objects, without any templates. Our HDM is an image-conditioned diffusion model that learns both realistic interaction and highly accurate human and object shapes. Experiments show that our HDM trained with ProciGen significantly outperforms prior methods that requires template meshes and that our dataset allows training methods with strong generalization ability to unseen object instances. Our code and data are released.

Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation

TL;DR

This work tackles template-free reconstruction of 3D human–object interactions from a single RGB image by introducing ProciGen, a scalable procedural data generator that yields over 1M synthetic interactions across thousands of object shapes, and HDM, a two-stage image-conditioned diffusion framework. The first stage jointly predicts a human–object scene; the second stage refines separate human and object shapes using cross-attention-based diffusion models while preserving interaction context. Empirical results on BEHAVE and InterCap show that HDM trained with ProciGen outperforms template-based (CHORE) and template-free (PC) baselines and generalizes to unseen objects and wild images like COCO. The approach demonstrates a scalable, template-free pathway to realistic 3D avatars and robotics applications, with code and data released for future research.

Abstract

Reconstructing human-object interaction in 3D from a single RGB image is a challenging task and existing data driven methods do not generalize beyond the objects present in the carefully curated 3D interaction datasets. Capturing large-scale real data to learn strong interaction and 3D shape priors is very expensive due to the combinatorial nature of human-object interactions. In this paper, we propose ProciGen (Procedural interaction Generation), a method to procedurally generate datasets with both, plausible interaction and diverse object variation. We generate 1M+ human-object interaction pairs in 3D and leverage this large-scale data to train our HDM (Hierarchical Diffusion Model), a novel method to reconstruct interacting human and unseen objects, without any templates. Our HDM is an image-conditioned diffusion model that learns both realistic interaction and highly accurate human and object shapes. Experiments show that our HDM trained with ProciGen significantly outperforms prior methods that requires template meshes and that our dataset allows training methods with strong generalization ability to unseen object instances. Our code and data are released.
Paper Structure (30 sections, 6 equations, 20 figures, 6 tables)

This paper contains 30 sections, 6 equations, 20 figures, 6 tables.

Figures (20)

  • Figure 1: Given a single RGB image, our method trained only on our proposed synthetic interaction dataset, can reconstruct the human, object and contacts, without any predefined template meshes.
  • Figure 2: Our procedural interaction generation method. Given a seed interaction and a new object from the same category (A), we use a network to compute dense correspondences (B, \ref{['subsec:dense_correspondence']}), which allows us to transfer contacts and initialize the new object (C, \ref{['subsec:contact-transfer']}). We further optimize the human and object poses to avoid interpenetration while satisfying the transferred contacts (D, \ref{['subsec:contact-opt']}). We then add clothing and textures to render images, leading to a large interaction dataset with diverse object shapes (E, \ref{['subsec:rendering']}).
  • Figure 3: Our hierarchical diffusion model. Given an RGB image of a human interacting with an object, we first jointly reconstruct the human and object as one point cloud with segmentation labels (Stage 1, \ref{['subsec-joint-diffusion']}). This prediction reasons interaction but lacks accurate shapes. We then use two diffusion models for human or object separately with cross attention to refine the initial noisy prediction while preserving the interaction context(Stage 2, \ref{['subsec:hierarchical-diffusion']}). Our hierarchical design faithfully predicts interaction and shapes.
  • Figure 4: Comparing reconstruction results on BEHAVEbhatnagar22behave dataset. CHORExie22chore relies on object mesh templates and the prediction is inaccurate for challenging poses. $\text{PC}^2$melaskyriazi2023pc2 does not rely on templates but its predicted point clouds are noisy (red circles) and it cannot predict contacts. Ours can reason about human object interaction, and predicts high-fidelity human and object shapes without templates.
  • Figure 5: Generalization results to InterCap huang2022intercap dataset. Note that all object instances are unseen during training time. CHORE xie22chore predicts template specific object pose hence cannot generalize to new object instances. $\text{PC}^2$melaskyriazi2023pc2 does not rely on template but its generalization ability is constrained by limited shape variations from BEHAVE bhatnagar22behave. Training $\text{PC}^2$ on our ProciGen improves its generalization but the predicted point clouds are still noisy. Our method is able to generalize and predicts human and object with high shape fidelity.
  • ...and 15 more figures