Table of Contents
Fetching ...

Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models

Hyeonwoo Kim, Sookwan Han, Patrick Kwon, Hanbyul Joo

TL;DR

This work introduces Comprehensive Affordance (ComA), a representation that models both contact and non-contact human-object interaction patterns by learning distributions over relative proximity $\mathbf{p}$ and orientation $\mathbf{n}$ between object and human surfaces. It builds a scalable pipeline, Rendering-Inpainting-Uplifting, that synthesizes large-scale 3D HOI samples from 3D objects using a pre-trained 2D diffusion model with Adaptive Mask Inpainting, then lifts 2D cues to 3D via SMPL-X predictions and depth optimization. ComA defines a joint distribution $\mathcal{P}_{ij}(\mathbf{p},\mathbf{n})$ and computes pointwise affordances through $\mathbb{E}_{\mathbf{p},\mathbf{n}\sim \mathcal{P}_{ij}}[f(\mathbf{p},\mathbf{n})]$ for contact, orientation, and spatial cues, enabling robust HOI understanding beyond mere contact. Experiments on BEHAVE, InterCap, ShapeNet, and SAPIEN show ComA outperforms contact-focused baselines, preserves object context with Adaptive Mask Inpainting, and enables 3D HOI reconstruction and transfer within object categories, highlighting significant potential for scalable 3D affordance priors in robotics and interaction understanding.

Abstract

Understanding the inherent human knowledge in interacting with a given environment (e.g., affordance) is essential for improving AI to better assist humans. While existing approaches primarily focus on human-object contacts during interactions, such affordance representation cannot fully address other important aspects of human-object interactions (HOIs), i.e., patterns of relative positions and orientations. In this paper, we introduce a novel affordance representation, named Comprehensive Affordance (ComA). Given a 3D object mesh, ComA models the distribution of relative orientation and proximity of vertices in interacting human meshes, capturing plausible patterns of contact, relative orientations, and spatial relationships. To construct the distribution, we present a novel pipeline that synthesizes diverse and realistic 3D HOI samples given any 3D object mesh. The pipeline leverages a pre-trained 2D inpainting diffusion model to generate HOI images from object renderings and lifts them into 3D. To avoid the generation of false affordances, we propose a new inpainting framework, Adaptive Mask Inpainting. Since ComA is built on synthetic samples, it can extend to any object in an unbounded manner. Through extensive experiments, we demonstrate that ComA outperforms competitors that rely on human annotations in modeling contact-based affordance. Importantly, we also showcase the potential of ComA to reconstruct human-object interactions in 3D through an optimization framework, highlighting its advantage in incorporating both contact and non-contact properties.

Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models

TL;DR

This work introduces Comprehensive Affordance (ComA), a representation that models both contact and non-contact human-object interaction patterns by learning distributions over relative proximity and orientation between object and human surfaces. It builds a scalable pipeline, Rendering-Inpainting-Uplifting, that synthesizes large-scale 3D HOI samples from 3D objects using a pre-trained 2D diffusion model with Adaptive Mask Inpainting, then lifts 2D cues to 3D via SMPL-X predictions and depth optimization. ComA defines a joint distribution and computes pointwise affordances through for contact, orientation, and spatial cues, enabling robust HOI understanding beyond mere contact. Experiments on BEHAVE, InterCap, ShapeNet, and SAPIEN show ComA outperforms contact-focused baselines, preserves object context with Adaptive Mask Inpainting, and enables 3D HOI reconstruction and transfer within object categories, highlighting significant potential for scalable 3D affordance priors in robotics and interaction understanding.

Abstract

Understanding the inherent human knowledge in interacting with a given environment (e.g., affordance) is essential for improving AI to better assist humans. While existing approaches primarily focus on human-object contacts during interactions, such affordance representation cannot fully address other important aspects of human-object interactions (HOIs), i.e., patterns of relative positions and orientations. In this paper, we introduce a novel affordance representation, named Comprehensive Affordance (ComA). Given a 3D object mesh, ComA models the distribution of relative orientation and proximity of vertices in interacting human meshes, capturing plausible patterns of contact, relative orientations, and spatial relationships. To construct the distribution, we present a novel pipeline that synthesizes diverse and realistic 3D HOI samples given any 3D object mesh. The pipeline leverages a pre-trained 2D inpainting diffusion model to generate HOI images from object renderings and lifts them into 3D. To avoid the generation of false affordances, we propose a new inpainting framework, Adaptive Mask Inpainting. Since ComA is built on synthetic samples, it can extend to any object in an unbounded manner. Through extensive experiments, we demonstrate that ComA outperforms competitors that rely on human annotations in modeling contact-based affordance. Importantly, we also showcase the potential of ComA to reconstruct human-object interactions in 3D through an optimization framework, highlighting its advantage in incorporating both contact and non-contact properties.
Paper Structure (34 sections, 13 equations, 13 figures, 2 tables, 1 algorithm)

This paper contains 34 sections, 13 equations, 13 figures, 2 tables, 1 algorithm.

Figures (13)

  • Figure 1: Given a 3D object, we generate numerous 3D Human-Object Interaction (HOI) samples using text prompts, and learn a novel affordance representation called Comprehensive Affordance (ComA) which models both contact and non-contact HOI patterns.
  • Figure 2: Typically, people (1) view the screen (2) from relatively distant distance while using a laptop (Left), whereas they (1) peer into it (2) from a close distance while using a telescope (Right). Pre-trained diffusion model has a knowledge of these (1) orientational and (2) spatial relation between human and object during interaction.
  • Figure 3: Method Overview. Our method can be divided into two parts: (1) Generating 3D HOI Samples and (2) Learning ComA from Generated 3D HOI Samples. In the first step, we utilize an inpainting diffusion model with our Adaptive Mask Inpainting to create 2D HOI images, and generate 3D HOI samples via uplifting pipeline. In the second step, the generated 3D HOI samples are aggregated to create distributions for relative proximity and orientation, which can be derived into various affordance forms.
  • Figure 4: Adaptive Mask Inpainting. Without Adaptive Mask Inpainting, the original object is damaged when inserting humans, resulting in false affordances.
  • Figure 5: Qualitative Results. ComA can model distributions of contact, orientation, and spatial relation exhibited during the interaction between humans and novel objects.
  • ...and 8 more figures