Table of Contents
Fetching ...

Efficient and Scalable Monocular Human-Object Interaction Motion Reconstruction

Boran Wen, Ye Lu, Keyan Wan, Sirui Wang, Jiahong Zhou, Junxuan Liang, Xinpeng Liu, Bang Xiao, Dingbang Huang, Ruiyang Liu, Yong-Lu Li

TL;DR

This work tackles the challenge of scalable 4D human–object interaction reconstruction from monocular videos by introducing 4DHOISolver, a two-stage optimization that uses sparse, human-in-the-loop contact annotations to enforce temporal coherence and physical plausibility. Coupled with Open4DHOI, a large-scale dataset spanning 144 object types and 103 actions, the approach enables effective motion imitation for humanoid agents via novel contact-based rewards. The authors also provide a rigorous benchmark showing current 3D foundation models struggle to predict precise human–object contact correspondences, underscoring the value of human-in-the-loop guidance. Together, the pipeline and dataset offer a scalable path toward open-world HOI learning and control, while highlighting key open challenges in automated contact prediction."

Abstract

Generalized robots must learn from diverse, large-scale human-object interactions (HOI) to operate robustly in the real world. Monocular internet videos offer a nearly limitless and readily available source of data, capturing an unparalleled diversity of human activities, objects, and environments. However, accurately and scalably extracting 4D interaction data from these in-the-wild videos remains a significant and unsolved challenge. Thus, in this work, we introduce 4DHOISolver, a novel and efficient optimization framework that constrains the ill-posed 4D HOI reconstruction problem by leveraging sparse, human-in-the-loop contact point annotations, while maintaining high spatio-temporal coherence and physical plausibility. Leveraging this framework, we introduce Open4DHOI, a new large-scale 4D HOI dataset featuring a diverse catalog of 144 object types and 103 actions. Furthermore, we demonstrate the effectiveness of our reconstructions by enabling an RL-based agent to imitate the recovered motions. However, a comprehensive benchmark of existing 3D foundation models indicates that automatically predicting precise human-object contact correspondences remains an unsolved problem, underscoring the immediate necessity of our human-in-the-loop strategy while posing an open challenge to the community. Data and code will be publicly available at https://wenboran2002.github.io/open4dhoi/

Efficient and Scalable Monocular Human-Object Interaction Motion Reconstruction

TL;DR

This work tackles the challenge of scalable 4D human–object interaction reconstruction from monocular videos by introducing 4DHOISolver, a two-stage optimization that uses sparse, human-in-the-loop contact annotations to enforce temporal coherence and physical plausibility. Coupled with Open4DHOI, a large-scale dataset spanning 144 object types and 103 actions, the approach enables effective motion imitation for humanoid agents via novel contact-based rewards. The authors also provide a rigorous benchmark showing current 3D foundation models struggle to predict precise human–object contact correspondences, underscoring the value of human-in-the-loop guidance. Together, the pipeline and dataset offer a scalable path toward open-world HOI learning and control, while highlighting key open challenges in automated contact prediction."

Abstract

Generalized robots must learn from diverse, large-scale human-object interactions (HOI) to operate robustly in the real world. Monocular internet videos offer a nearly limitless and readily available source of data, capturing an unparalleled diversity of human activities, objects, and environments. However, accurately and scalably extracting 4D interaction data from these in-the-wild videos remains a significant and unsolved challenge. Thus, in this work, we introduce 4DHOISolver, a novel and efficient optimization framework that constrains the ill-posed 4D HOI reconstruction problem by leveraging sparse, human-in-the-loop contact point annotations, while maintaining high spatio-temporal coherence and physical plausibility. Leveraging this framework, we introduce Open4DHOI, a new large-scale 4D HOI dataset featuring a diverse catalog of 144 object types and 103 actions. Furthermore, we demonstrate the effectiveness of our reconstructions by enabling an RL-based agent to imitate the recovered motions. However, a comprehensive benchmark of existing 3D foundation models indicates that automatically predicting precise human-object contact correspondences remains an unsolved problem, underscoring the immediate necessity of our human-in-the-loop strategy while posing an open challenge to the community. Data and code will be publicly available at https://wenboran2002.github.io/open4dhoi/

Paper Structure

This paper contains 51 sections, 16 equations, 19 figures, 7 tables, 1 algorithm.

Figures (19)

  • Figure 1: Our work aims to efficiently reconstruct HOI motions from in-the-wild monocular video data in an efficient and scalable manner, while enabling the reconstructed data to support downstream tasks such as humanoid learning and control.
  • Figure 2: Our automated 4D reconstruction pipeline consists of three components: (a) human and object tracking, (b) 3D reconstruction, and (c) spatial alignment.
  • Figure 3: Annotation app: the first row shows the reference video, the second row displays the 3D-Human Joint annotations, and the third row presents the 3D-2D Projection annotations.
  • Figure 4: Pipeline: Our reconstruction pipeline consists of four stages. First, we perform automated reconstruction as described in Sec. \ref{['sec:4d_reconstruction']}. After obtaining the reconstructed results, we apply the 4DHOISolver from Sec. \ref{['sec:4dhoisolver']} for optimization based on the annotations. Finally, we conduct physical imitation as described in Sec. \ref{['sec:hoi_simulation']}.
  • Figure 5: Visualization of our HOI Imitation results.
  • ...and 14 more figures