Table of Contents
Fetching ...

LEMON: Learning 3D Human-Object Interaction Relation from 2D Images

Yuhang Yang, Wei Zhai, Hongchen Luo, Yang Cao, Zheng-Jun Zha

TL;DR

LEMON addresses the challenge of 3D human–object interaction understanding by jointly predicting dense 3D elements—human contact, object affordance, and spatial relation—through a unified framework that exploits interaction intention and geometric correlations between humans and objects. The method introduces interaction intention excavation via multi-branch attention, curvature-guided geometric correlation, and a contact-aware spatial relation module, all trained with a composite loss. The 3DIR dataset supplies paired HOI images, object point clouds, SMPL-H pseudo-GTs, and dense annotations to support training and evaluation. Across rigorous experiments, LEMON achieves state-of-the-art results on all targeted HOI elements and demonstrates strong generalization to multiple interactions, objects, and instances, highlighting its potential for embodied AI, robotics, and AR/VR applications.

Abstract

Learning 3D human-object interaction relation is pivotal to embodied AI and interaction modeling. Most existing methods approach the goal by learning to predict isolated interaction elements, e.g., human contact, object affordance, and human-object spatial relation, primarily from the perspective of either the human or the object. Which underexploit certain correlations between the interaction counterparts (human and object), and struggle to address the uncertainty in interactions. Actually, objects' functionalities potentially affect humans' interaction intentions, which reveals what the interaction is. Meanwhile, the interacting humans and objects exhibit matching geometric structures, which presents how to interact. In light of this, we propose harnessing these inherent correlations between interaction counterparts to mitigate the uncertainty and jointly anticipate the above interaction elements in 3D space. To achieve this, we present LEMON (LEarning 3D huMan-Object iNteraction relation), a unified model that mines interaction intentions of the counterparts and employs curvatures to guide the extraction of geometric correlations, combining them to anticipate the interaction elements. Besides, the 3D Interaction Relation dataset (3DIR) is collected to serve as the test bed for training and evaluation. Extensive experiments demonstrate the superiority of LEMON over methods estimating each element in isolation.

LEMON: Learning 3D Human-Object Interaction Relation from 2D Images

TL;DR

LEMON addresses the challenge of 3D human–object interaction understanding by jointly predicting dense 3D elements—human contact, object affordance, and spatial relation—through a unified framework that exploits interaction intention and geometric correlations between humans and objects. The method introduces interaction intention excavation via multi-branch attention, curvature-guided geometric correlation, and a contact-aware spatial relation module, all trained with a composite loss. The 3DIR dataset supplies paired HOI images, object point clouds, SMPL-H pseudo-GTs, and dense annotations to support training and evaluation. Across rigorous experiments, LEMON achieves state-of-the-art results on all targeted HOI elements and demonstrates strong generalization to multiple interactions, objects, and instances, highlighting its potential for embodied AI, robotics, and AR/VR applications.

Abstract

Learning 3D human-object interaction relation is pivotal to embodied AI and interaction modeling. Most existing methods approach the goal by learning to predict isolated interaction elements, e.g., human contact, object affordance, and human-object spatial relation, primarily from the perspective of either the human or the object. Which underexploit certain correlations between the interaction counterparts (human and object), and struggle to address the uncertainty in interactions. Actually, objects' functionalities potentially affect humans' interaction intentions, which reveals what the interaction is. Meanwhile, the interacting humans and objects exhibit matching geometric structures, which presents how to interact. In light of this, we propose harnessing these inherent correlations between interaction counterparts to mitigate the uncertainty and jointly anticipate the above interaction elements in 3D space. To achieve this, we present LEMON (LEarning 3D huMan-Object iNteraction relation), a unified model that mines interaction intentions of the counterparts and employs curvatures to guide the extraction of geometric correlations, combining them to anticipate the interaction elements. Besides, the 3D Interaction Relation dataset (3DIR) is collected to serve as the test bed for training and evaluation. Extensive experiments demonstrate the superiority of LEMON over methods estimating each element in isolation.
Paper Structure (17 sections, 8 equations, 15 figures, 12 tables)

This paper contains 17 sections, 8 equations, 15 figures, 12 tables.

Figures (15)

  • Figure 1: For an interaction image with paired geometries of the human and object, LEMON learns 3D human-object interaction relation by jointly anticipating the interaction elements, including human contact, object affordance, and human-object spatial relation. Vertices in yellow denote those in contact with the object, regions in red are object affordance regions, and the translucent sphere is the object proxy.
  • Figure 2: Motivation. Affinities within the HOI. The object affordance inherently reveals the human's interaction intention, arising the intention affinity. The interacting human and object possess matching structures, exhibiting the geometry affinity.
  • Figure 3: LEMON pipeline. Initially, it takes modality-specific backbones to extract respective features $\mathbf{F}_{h}, \mathbf{F}_{o}, \mathbf{F}_{i}$, which are then utilized to excavate intention features ($\bar{\mathbf{T}}_{o}, \bar{\mathbf{T}}_{h}$) of the interaction (Sec. \ref{['Sec.3.2']}). With $\bar{\mathbf{T}}_{o}, \bar{\mathbf{T}}_{h}$ as conditions, LEMON integrates curvatures ($C_o, C_h$) to model geometric correlations and reveal the contact $\phi_{c}$, affordance $\phi_{a}$ features (Sec. \ref{['Sec.3.3']}). Following, the $\phi_{c}$ is injected into the calculation of the object spatial feature $\phi_{p}$ (Sec. \ref{['Sec.3.4']}). Eventually, the decoder projects $\phi_{c}, \phi_{a}, \phi_{p}$ to the final outputs $\bar{\phi}_{c}, \bar{\phi}_{a}, \bar{\phi}_{p}$.
  • Figure 4: 3DIR Dataset.(a) The quantity of images and point clouds for each object, and a data sample containing the image, mask, dense human contact annotation, 3D object with affordance annotation, and the fitted human mesh with the object proxy sphere. (b) The proportion of our contact annotations within 24 parts on SMPL loper2023smpl, and distributions of contact vertices for certain HOIs. (c) The ratio of annotated affordance regions to the whole object geometries, and the distribution of this ratio for some categories. (d) Mean distances (unit: m) between annotated object centers and human pelvis joints, and directional projections of annotated centers for several objects.
  • Figure 5: Visualization Results.(a) Results of the estimated human vertices in contact with objects, the estimated contact vertices are shown in yellow. (b) The anticipations of 3D object affordance, the depth of red represents the probability of anticipated affordance. (c) Two views of the predicted spatial relation, translucent spheres are object proxies. Please zoom in for a better visualization.
  • ...and 10 more figures