I'M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions
Chengfeng Zhao, Juze Zhang, Jiashen Du, Ziwei Shan, Junye Wang, Jingyi Yu, Jingya Wang, Lan Xu
TL;DR
This work tackles monocular capture of 3D human–object interactions using a minimal RGB setup plus an object-mounted IMU. It introduces a two-stage framework: a general interaction motion inference module that fuses RGB and IMU data in an end-to-end holistic tracker, and a category-aware diffusion filter that refines motions by enforcing object-category priors and infilling hand motion, using an over-parameterized representation. Key contributions include the I'm-HOI method, a large IMHD$^2$ dataset with ground-truth meshes and rich IMU data, and a diffusion-based refinement strategy that yields vivid, temporally coherent HOI motions with competitive runtime (about $0.5$ seconds per frame). The approach advances hybrid vision–inertial HOI capture, offering scalable, practical motion capture for robotics, VR/AR, and embodied AI, and the dataset and code are released for community use. The mathematical framework combines the sequence-length specific processing ($T=64$) and diffusion denoising conditioned on IMU and prior results, enabling robust performance under occlusion and fast motions such as skateboarding.
Abstract
We are living in a world surrounded by diverse and "smart" devices with rich modalities of sensing ability. Conveniently capturing the interactions between us humans and these objects remains far-reaching. In this paper, we present I'm-HOI, a monocular scheme to faithfully capture the 3D motions of both the human and object in a novel setting: using a minimal amount of RGB camera and object-mounted Inertial Measurement Unit (IMU). It combines general motion inference and category-aware refinement. For the former, we introduce a holistic human-object tracking method to fuse the IMU signals and the RGB stream and progressively recover the human motions and subsequently the companion object motions. For the latter, we tailor a category-aware motion diffusion model, which is conditioned on both the raw IMU observations and the results from the previous stage under over-parameterization representation. It significantly refines the initial results and generates vivid body, hand, and object motions. Moreover, we contribute a large dataset with ground truth human and object motions, dense RGB inputs, and rich object-mounted IMU measurements. Extensive experiments demonstrate the effectiveness of I'm-HOI under a hybrid capture setting. Our dataset and code will be released to the community.
