Table of Contents
Fetching ...

MMHOI: Modeling Complex 3D Multi-Human Multi-Object Interactions

Kaen Kogashi, Anoop Cherian, Meng-Yu Jennifer Kuo

TL;DR

MMHOI tackles the complexity of real-world multi-human multi-object interactions by introducing MMHOI, a large-scale 3D annotation-rich dataset, and MMHOI-Net, a ViT-based framework with a structured dual-patch object representation and action-guided reconstruction. The method jointly estimates 3D human and object geometries, actions, and interactive body parts, enforcing spatial and semantic consistency through reconstruction and interaction losses. Key contributions include the dual-patch object representation, an explicit HOI head for multi-entity action and body-part detection, and strong state-of-the-art results on MMHOI and the CORE4D dataset, with demonstrated generalization to unseen objects. This work advances 3D multi-HOI understanding and provides a substantial resource for developing robust, interaction-aware 3D vision systems in real-world settings.

Abstract

Real-world scenes often feature multiple humans interacting with multiple objects in ways that are causal, goal-oriented, or cooperative. Yet existing 3D human-object interaction (HOI) benchmarks consider only a fraction of these complex interactions. To close this gap, we present MMHOI -- a large-scale, Multi-human Multi-object Interaction dataset consisting of images from 12 everyday scenarios. MMHOI offers complete 3D shape and pose annotations for every person and object, along with labels for 78 action categories and 14 interaction-specific body parts, providing a comprehensive testbed for next-generation HOI research. Building on MMHOI, we present MMHOI-Net, an end-to-end transformer-based neural network for jointly estimating human-object 3D geometries, their interactions, and associated actions. A key innovation in our framework is a structured dual-patch representation for modeling objects and their interactions, combined with action recognition to enhance the interaction prediction. Experiments on MMHOI and the recently proposed CORE4D datasets demonstrate that our approach achieves state-of-the-art performance in multi-HOI modeling, excelling in both accuracy and reconstruction quality. The MMHOI dataset is publicly available at https://zenodo.org/records/17711786.

MMHOI: Modeling Complex 3D Multi-Human Multi-Object Interactions

TL;DR

MMHOI tackles the complexity of real-world multi-human multi-object interactions by introducing MMHOI, a large-scale 3D annotation-rich dataset, and MMHOI-Net, a ViT-based framework with a structured dual-patch object representation and action-guided reconstruction. The method jointly estimates 3D human and object geometries, actions, and interactive body parts, enforcing spatial and semantic consistency through reconstruction and interaction losses. Key contributions include the dual-patch object representation, an explicit HOI head for multi-entity action and body-part detection, and strong state-of-the-art results on MMHOI and the CORE4D dataset, with demonstrated generalization to unseen objects. This work advances 3D multi-HOI understanding and provides a substantial resource for developing robust, interaction-aware 3D vision systems in real-world settings.

Abstract

Real-world scenes often feature multiple humans interacting with multiple objects in ways that are causal, goal-oriented, or cooperative. Yet existing 3D human-object interaction (HOI) benchmarks consider only a fraction of these complex interactions. To close this gap, we present MMHOI -- a large-scale, Multi-human Multi-object Interaction dataset consisting of images from 12 everyday scenarios. MMHOI offers complete 3D shape and pose annotations for every person and object, along with labels for 78 action categories and 14 interaction-specific body parts, providing a comprehensive testbed for next-generation HOI research. Building on MMHOI, we present MMHOI-Net, an end-to-end transformer-based neural network for jointly estimating human-object 3D geometries, their interactions, and associated actions. A key innovation in our framework is a structured dual-patch representation for modeling objects and their interactions, combined with action recognition to enhance the interaction prediction. Experiments on MMHOI and the recently proposed CORE4D datasets demonstrate that our approach achieves state-of-the-art performance in multi-HOI modeling, excelling in both accuracy and reconstruction quality. The MMHOI dataset is publicly available at https://zenodo.org/records/17711786.

Paper Structure

This paper contains 45 sections, 5 equations, 20 figures, 5 tables.

Figures (20)

  • Figure 1: Example scenes from our MMHOI dataset -- a new, large-scale dataset with high-quality annotations of multiple 3D humans, objects, actions, and interaction body parts, enabling holistic reasoning in complex interaction scenarios. Person IDs are shown in italic, action annotations in bold, object names in regular font, and interacting body parts are highlighted in orange text.
  • Figure 2: Overview of MMHOI. The dataset is categorized into three main interaction types: dining, collaborative work, and recreational activities, each type belongs to 12 scenarios. MMHOI consists of (a) RGB images, (b) segmentation masks, (c) 3D tracking of multiple humans and objects, and (d) action and interactive body part labels.
  • Figure 3: MMHOI-Net model architecture. Given a single RGB image, our model jointly estimates the 3D geometry of multiple humans and objects while incorporating action recognition as a supervisory signal. A ViT backbone extracts patch-level features. For human perception, detected keypoints serve as queries in a Human Perception Head multi-hmr2024, regressing SMPL-X pose, shape, and translation. Object perception head uses a structured dual-patch representation to regress object 6DoF, center, and depth. An action MLP predicts action and interaction body part classes.
  • Figure 4: Our structured dual-patch object representation for inferring the object mesh parameters. The black arrows indicate approximate orientations of the objects.
  • Figure 5: Evaluation of interaction prediction on MMHOI. (a) plots the % of scenes where the predicted interaction body parts are close to the objects within a threshold (x-axis, in cm). (b) shows object-object interaction prediction comparison. Both plots evaluate multi-HOI accuracies after Procrustes alignment. The plots highlight the benefit of explicitly modeling multi-HOIs.
  • ...and 15 more figures