Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models
Hyeonwoo Kim, Sookwan Han, Patrick Kwon, Hanbyul Joo
TL;DR
This work introduces Comprehensive Affordance (ComA), a representation that models both contact and non-contact human-object interaction patterns by learning distributions over relative proximity $\mathbf{p}$ and orientation $\mathbf{n}$ between object and human surfaces. It builds a scalable pipeline, Rendering-Inpainting-Uplifting, that synthesizes large-scale 3D HOI samples from 3D objects using a pre-trained 2D diffusion model with Adaptive Mask Inpainting, then lifts 2D cues to 3D via SMPL-X predictions and depth optimization. ComA defines a joint distribution $\mathcal{P}_{ij}(\mathbf{p},\mathbf{n})$ and computes pointwise affordances through $\mathbb{E}_{\mathbf{p},\mathbf{n}\sim \mathcal{P}_{ij}}[f(\mathbf{p},\mathbf{n})]$ for contact, orientation, and spatial cues, enabling robust HOI understanding beyond mere contact. Experiments on BEHAVE, InterCap, ShapeNet, and SAPIEN show ComA outperforms contact-focused baselines, preserves object context with Adaptive Mask Inpainting, and enables 3D HOI reconstruction and transfer within object categories, highlighting significant potential for scalable 3D affordance priors in robotics and interaction understanding.
Abstract
Understanding the inherent human knowledge in interacting with a given environment (e.g., affordance) is essential for improving AI to better assist humans. While existing approaches primarily focus on human-object contacts during interactions, such affordance representation cannot fully address other important aspects of human-object interactions (HOIs), i.e., patterns of relative positions and orientations. In this paper, we introduce a novel affordance representation, named Comprehensive Affordance (ComA). Given a 3D object mesh, ComA models the distribution of relative orientation and proximity of vertices in interacting human meshes, capturing plausible patterns of contact, relative orientations, and spatial relationships. To construct the distribution, we present a novel pipeline that synthesizes diverse and realistic 3D HOI samples given any 3D object mesh. The pipeline leverages a pre-trained 2D inpainting diffusion model to generate HOI images from object renderings and lifts them into 3D. To avoid the generation of false affordances, we propose a new inpainting framework, Adaptive Mask Inpainting. Since ComA is built on synthetic samples, it can extend to any object in an unbounded manner. Through extensive experiments, we demonstrate that ComA outperforms competitors that rely on human annotations in modeling contact-based affordance. Importantly, we also showcase the potential of ComA to reconstruct human-object interactions in 3D through an optimization framework, highlighting its advantage in incorporating both contact and non-contact properties.
