Table of Contents
Fetching ...

PICO: Reconstructing 3D People In Contact with Objects

Alpár Cseke, Shashank Tripathi, Sai Kumar Dwivedi, Arjun Lakshmipathy, Agniv Chatterjee, Michael J. Black, Dimitrios Tzionas

TL;DR

PICO tackles 3D human–object interaction from a single image by introducing PICO-db, a dataset that provides dense, bijective 3D contact annotations on both humans and objects, and PICO-fit, an optimization-based fitting pipeline that leverages these contacts to recover coherent 3D body and object meshes. The framework retrieves likely object shapes via OpenShape, transfers body contact patches to objects with an axis-based two-click method, and uses render-and-compare optimization to align and refine both meshes while enforcing contact and penetration constraints. Evaluations on out-of-domain in-lab datasets and in-the-wild imagery show that PICO-fit achieves state-of-the-art–like performance, with perceptual studies indicating higher realism and generalization to previously untackled object classes. The work demonstrates that dense, cross-domain contacts can serve as a scalable foundation for HOI understanding in the wild and points to future directions in direct contact regression and vision–language model integration.

Abstract

Recovering 3D Human-Object Interaction (HOI) from single color images is challenging due to depth ambiguities, occlusions, and the huge variation in object shape and appearance. Thus, past work requires controlled settings such as known object shapes and contacts, and tackles only limited object classes. Instead, we need methods that generalize to natural images and novel object classes. We tackle this in two main ways: (1) We collect PICO-db, a new dataset of natural images uniquely paired with dense 3D contact on both body and object meshes. To this end, we use images from the recent DAMON dataset that are paired with contacts, but these contacts are only annotated on a canonical 3D body. In contrast, we seek contact labels on both the body and the object. To infer these given an image, we retrieve an appropriate 3D object mesh from a database by leveraging vision foundation models. Then, we project DAMON's body contact patches onto the object via a novel method needing only 2 clicks per patch. This minimal human input establishes rich contact correspondences between bodies and objects. (2) We exploit our new dataset of contact correspondences in a novel render-and-compare fitting method, called PICO-fit, to recover 3D body and object meshes in interaction. PICO-fit infers contact for the SMPL-X body, retrieves a likely 3D object mesh and contact from PICO-db for that object, and uses the contact to iteratively fit the 3D body and object meshes to image evidence via optimization. Uniquely, PICO-fit works well for many object categories that no existing method can tackle. This is crucial to enable HOI understanding to scale in the wild. Our data and code are available at https://pico.is.tue.mpg.de.

PICO: Reconstructing 3D People In Contact with Objects

TL;DR

PICO tackles 3D human–object interaction from a single image by introducing PICO-db, a dataset that provides dense, bijective 3D contact annotations on both humans and objects, and PICO-fit, an optimization-based fitting pipeline that leverages these contacts to recover coherent 3D body and object meshes. The framework retrieves likely object shapes via OpenShape, transfers body contact patches to objects with an axis-based two-click method, and uses render-and-compare optimization to align and refine both meshes while enforcing contact and penetration constraints. Evaluations on out-of-domain in-lab datasets and in-the-wild imagery show that PICO-fit achieves state-of-the-art–like performance, with perceptual studies indicating higher realism and generalization to previously untackled object classes. The work demonstrates that dense, cross-domain contacts can serve as a scalable foundation for HOI understanding in the wild and points to future directions in direct contact regression and vision–language model integration.

Abstract

Recovering 3D Human-Object Interaction (HOI) from single color images is challenging due to depth ambiguities, occlusions, and the huge variation in object shape and appearance. Thus, past work requires controlled settings such as known object shapes and contacts, and tackles only limited object classes. Instead, we need methods that generalize to natural images and novel object classes. We tackle this in two main ways: (1) We collect PICO-db, a new dataset of natural images uniquely paired with dense 3D contact on both body and object meshes. To this end, we use images from the recent DAMON dataset that are paired with contacts, but these contacts are only annotated on a canonical 3D body. In contrast, we seek contact labels on both the body and the object. To infer these given an image, we retrieve an appropriate 3D object mesh from a database by leveraging vision foundation models. Then, we project DAMON's body contact patches onto the object via a novel method needing only 2 clicks per patch. This minimal human input establishes rich contact correspondences between bodies and objects. (2) We exploit our new dataset of contact correspondences in a novel render-and-compare fitting method, called PICO-fit, to recover 3D body and object meshes in interaction. PICO-fit infers contact for the SMPL-X body, retrieves a likely 3D object mesh and contact from PICO-db for that object, and uses the contact to iteratively fit the 3D body and object meshes to image evidence via optimization. Uniquely, PICO-fit works well for many object categories that no existing method can tackle. This is crucial to enable HOI understanding to scale in the wild. Our data and code are available at https://pico.is.tue.mpg.de.

Paper Structure

This paper contains 35 sections, 19 figures, 3 tables.

Figures (19)

  • Figure 1: PICO-db dataset annotations. Left to right: Color image. Contacts (shown in variouscolors) annotated on the body and object. Contact annotations establish bijective body-object correspondences, denoted with color-coding.
  • Figure 2: Example contact patches with their contact axis.
  • Figure 3: Overview of PICO-fit, a novel method for fitting interacting 3D body and object meshes to an image. It initializes (\ref{['sec:pico_fit_initialization']}) 3D body shape and pose via OSX lin2023osx, 3D object shape via OpenShape liu2023openshape, and body-object contacts via retrieval from PICO-db (\ref{['sec:dataset']}). Then, it takes three steps: (1) It exploits contacts to solve for object pose, to register the object to the body (\ref{['sec:pico_fit_stage_1']}). (2) It refines object pose (\ref{['sec:pico_fit_stage_2']}) and (3) body pose (\ref{['sec:pico_fit_stage_3']}) to align these to an object and human mask, respectively, detected in the image while satisfying contacts and avoiding penetrations. For every stage we show inputs, outputs, losses, and optimizable variables. Zoom in to see details.
  • Figure 4: Qualitative comparison of PICO-fit vs PHOSA on internet images used for evaluation in the PHOSA paper zhang2020phosa.
  • Figure 5: Qualitative evaluation of $\text{CONTHO\xspace}^*$, HDM and $\text{PHOSA\xspace}^*$ alongside $\text{PICO\xspace-fit\xspace}^*$ on object categories handled by all baselines. From left to right: input image, pseudo-GT contact annotations in PICO-db, and 3D reconstructions (a side and top-down view per method). Reconstructions from $\text{PICO\xspace-fit\xspace}^*$ have better 3D human-object contact and spatial alignment. For more comparisons, see Sup. Mat.
  • ...and 14 more figures