I'M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions

Chengfeng Zhao; Juze Zhang; Jiashen Du; Ziwei Shan; Junye Wang; Jingyi Yu; Jingya Wang; Lan Xu

I'M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions

Chengfeng Zhao, Juze Zhang, Jiashen Du, Ziwei Shan, Junye Wang, Jingyi Yu, Jingya Wang, Lan Xu

TL;DR

This work tackles monocular capture of 3D human–object interactions using a minimal RGB setup plus an object-mounted IMU. It introduces a two-stage framework: a general interaction motion inference module that fuses RGB and IMU data in an end-to-end holistic tracker, and a category-aware diffusion filter that refines motions by enforcing object-category priors and infilling hand motion, using an over-parameterized representation. Key contributions include the I'm-HOI method, a large IMHD$^2$ dataset with ground-truth meshes and rich IMU data, and a diffusion-based refinement strategy that yields vivid, temporally coherent HOI motions with competitive runtime (about $0.5$ seconds per frame). The approach advances hybrid vision–inertial HOI capture, offering scalable, practical motion capture for robotics, VR/AR, and embodied AI, and the dataset and code are released for community use. The mathematical framework combines the sequence-length specific processing ($T=64$) and diffusion denoising conditioned on IMU and prior results, enabling robust performance under occlusion and fast motions such as skateboarding.

Abstract

We are living in a world surrounded by diverse and "smart" devices with rich modalities of sensing ability. Conveniently capturing the interactions between us humans and these objects remains far-reaching. In this paper, we present I'm-HOI, a monocular scheme to faithfully capture the 3D motions of both the human and object in a novel setting: using a minimal amount of RGB camera and object-mounted Inertial Measurement Unit (IMU). It combines general motion inference and category-aware refinement. For the former, we introduce a holistic human-object tracking method to fuse the IMU signals and the RGB stream and progressively recover the human motions and subsequently the companion object motions. For the latter, we tailor a category-aware motion diffusion model, which is conditioned on both the raw IMU observations and the results from the previous stage under over-parameterization representation. It significantly refines the initial results and generates vivid body, hand, and object motions. Moreover, we contribute a large dataset with ground truth human and object motions, dense RGB inputs, and rich object-mounted IMU measurements. Extensive experiments demonstrate the effectiveness of I'm-HOI under a hybrid capture setting. Our dataset and code will be released to the community.

I'M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions

TL;DR

dataset with ground-truth meshes and rich IMU data, and a diffusion-based refinement strategy that yields vivid, temporally coherent HOI motions with competitive runtime (about

seconds per frame). The approach advances hybrid vision–inertial HOI capture, offering scalable, practical motion capture for robotics, VR/AR, and embodied AI, and the dataset and code are released for community use. The mathematical framework combines the sequence-length specific processing (

) and diffusion denoising conditioned on IMU and prior results, enabling robust performance under occlusion and fast motions such as skateboarding.

Abstract

Paper Structure (49 sections, 16 equations, 11 figures, 6 tables)

This paper contains 49 sections, 16 equations, 11 figures, 6 tables.

Introduction
Related Work
Monocular Human-centric Capture.
Inertial and Multi-modal Motion Capture.
Object-specific Interaction Prior.
Method
General Interaction Motion Inference
Preprocessing.
End-to-end Holistic Human-Object Tracking.
Robust and Lightweight Optimization.
Category-specific Interaction Diffusion Filter
Interaction Representation.
Conditional Diffusion Denoising Process.
Dataset
Capture Preparations.
...and 34 more sections

Figures (11)

Figure 1: The pipeline of I'm-HOI. Assuming video and inertial measurements input, our approach consists of a general interaction motion inference module (Sec. \ref{['sec:general']}) and a category-specific interaction diffusion filter (Sec. \ref{['sec:specific']}) to capture challenging interaction motions.
Figure 2: We exhibit selected highlights of IMHD$^2$ on the left side, and 10 well-scanned objects on the right side. In total, our dataset comprises 295 sequences and captures approximately 892k frames of data.
Figure 3: Qualitative 3D capturing results of I'm-HOI on IMHD$^2$ dataset. Each sample includes an RGB image input, captured motion from camera view, and side-view visualization.
Figure 4: Qualitative comparison results. I'm-HOI outperforms the baselines and generalizes well to new datasets.
Figure 5: Qualitative evaluation of our network architecture. The figure illustrates the effectiveness of each key design.
...and 6 more figures

I'M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions

TL;DR

Abstract

I'M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions

Authors

TL;DR

Abstract

Table of Contents

Figures (11)