HOI-M3:Capture Multiple Humans and Objects Interaction within Contextual Environment

Juze Zhang; Jingyan Zhang; Zining Song; Zhanhe Shi; Chengfeng Zhao; Ye Shi; Jingyi Yu; Lan Xu; Jingya Wang

HOI-M3:Capture Multiple Humans and Objects Interaction within Contextual Environment

Juze Zhang, Jingyan Zhang, Zining Song, Zhanhe Shi, Chengfeng Zhao, Ye Shi, Jingyi Yu, Lan Xu, Jingya Wang

TL;DR

HOI-M3 addresses the scarcity of datasets for multi-human multi-object interactions by providing a large-scale, multi-view 3D motion capture dataset collected with dense RGB cameras and object-mounted IMUs. It introduces a robust capture and annotation pipeline, along with two data-driven downstream tasks: monocular capture of multiple HOI and unstructured generation of multiple HOI, each with strong baselines. The dataset comprises 181 million frames across 199 sequences, 42 viewpoints, 90 objects, and 31 human subjects, enabling rich HOI perception and generation research. By releasing data, code, and models, HOI-M3 aims to catalyze advances in understanding social interactions with surrounding objects for applications in embodied AI, robotics, and VR/AR.

Abstract

Humans naturally interact with both others and the surrounding multiple objects, engaging in various social activities. However, recent advances in modeling human-object interactions mostly focus on perceiving isolated individuals and objects, due to fundamental data scarcity. In this paper, we introduce HOI-M3, a novel large-scale dataset for modeling the interactions of Multiple huMans and Multiple objects. Notably, it provides accurate 3D tracking for both humans and objects from dense RGB and object-mounted IMU inputs, covering 199 sequences and 181M frames of diverse humans and objects under rich activities. With the unique HOI-M3 dataset, we introduce two novel data-driven tasks with companion strong baselines: monocular capture and unstructured generation of multiple human-object interactions. Extensive experiments demonstrate that our dataset is challenging and worthy of further research about multiple human-object interactions and behavior analysis. Our HOI-M3 dataset, corresponding codes, and pre-trained models will be disseminated to the community for future research.

HOI-M3:Capture Multiple Humans and Objects Interaction within Contextual Environment

TL;DR

Abstract

Paper Structure (28 sections, 15 equations, 12 figures, 4 tables)

This paper contains 28 sections, 15 equations, 12 figures, 4 tables.

Introduction
Related Works
HOI-M$^3$ Dataset
Overview
Data Capture System
Dataset Process Pipeline
Human Motion Capture
Inertial-aid Multi-object Tracking
Downstream Tasks
Monocular Multiple HOI Capture
Multiple Interaction Generation
Experiments
Evaluation of the Multiple HOI Capturing
Evaluation of the Multiple HOI Generation
Limitations
...and 13 more sections

Figures (12)

Figure 1: We meticulously collect a dataset capturing interactions involving multiple humans and multiple objects, named HOI-M$^{3}$. This extensive dataset comprises 181 million video frames recorded from 42 diverse viewpoints, covering a wide range of daily scenarios. It is intended to facilitate various tasks related to human-object interaction perception and generation.
Figure 2: Overview of HOI-M$^3$. (a) HOI-M$^3$ across five daily scenarios(Bedroom, Dinning Room, Living Room, Fitness Room, Office), (b) annotated masks corresponding to each subject(human, object), (c) tracking of multiple humans and multiple objects, (d) significant number of pre-scanned object meshes.
Figure 3: Monocular One-Stage Multiple HOI Capturing Pipeline. Given an input image, the pipeline predicts multiple maps: 1) the human-object center heatmap predicts the probability of the human's root position or object's center position, 2) the human mesh map contains the SMPL parameters and root depth, 3) the object mesh map contains the object 6D pose parameters and center depth. Through the sampling process, multiple humans and objects can be captured within a single forward process.
Figure 4: Multiple Interaction Generation Pipeline. Given multiple object geometry, we employ Pointnet to extract the geometry features and feed them forward with the features of the preset number of humans and objects using an MLP. The resulting features are then fed into a conditional diffusion model to generate multiple human-object interactions.
Figure 5: Qualitative comparisons of monocular multiple interaction capture on HOI-M$^3$ dataset with two state-of-the-art monocular HOI capturing methods PHOSA zhang2020phosa and CHORE xie2022chore.
...and 7 more figures

HOI-M3:Capture Multiple Humans and Objects Interaction within Contextual Environment

TL;DR

Abstract

HOI-M3:Capture Multiple Humans and Objects Interaction within Contextual Environment

Authors

TL;DR

Abstract

Table of Contents

Figures (12)