Table of Contents
Fetching ...

RHOBIN Challenge: Reconstruction of Human Object Interaction

Xianghui Xie, Xi Wang, Nikos Athanasiou, Bharat Lal Bhatnagar, Chun-Hao P. Huang, Kaichun Mo, Hao Chen, Xia Jia, Zerui Zhang, Liangxian Cui, Xiao Lin, Bingqiao Qian, Jie Xiao, Wenfei Yang, Hyeongjin Nam, Daniel Sungho Jung, Kihoon Kim, Kyoung Mu Lee, Otmar Hilliges, Gerard Pons-Moll

TL;DR

RHOBIN presents the first challenge focusing on Reconstruction of Human-Object Interaction from monocular RGB imagery, organized around three tracks: 3D human reconstruction, object 6DoF pose estimation, and joint human–object reconstruction, evaluated on the BEHAVE dataset. Across tracks, winning methods leverage 2D–3D correspondences (dense NOCS maps or keypoints) and often employ a two-stage strategy for joint tasks, combining regression with optimization or pose fitting. The results show substantial progress over baselines in all tracks, with human reconstruction nearing maturity under heavy occlusion, while object pose and joint reconstruction continue to benefit from better data augmentation, model ensembles, and explicit correspondence modeling. The paper argues for future work in temporal/video settings, template-free object handling, and extending the framework to more complex scenes with multiple humans and objects, to advance robust HOI reconstruction in the wild.

Abstract

Modeling the interaction between humans and objects has been an emerging research direction in recent years. Capturing human-object interaction is however a very challenging task due to heavy occlusion and complex dynamics, which requires understanding not only 3D human pose, and object pose but also the interaction between them. Reconstruction of 3D humans and objects has been two separate research fields in computer vision for a long time. We hence proposed the first RHOBIN challenge: reconstruction of human-object interactions in conjunction with the RHOBIN workshop. It was aimed at bringing the research communities of human and object reconstruction as well as interaction modeling together to discuss techniques and exchange ideas. Our challenge consists of three tracks of 3D reconstruction from monocular RGB images with a focus on dealing with challenging interaction scenarios. Our challenge attracted more than 100 participants with more than 300 submissions, indicating the broad interest in the research communities. This paper describes the settings of our challenge and discusses the winning methods of each track in more detail. We observe that the human reconstruction task is becoming mature even under heavy occlusion settings while object pose estimation and joint reconstruction remain challenging tasks. With the growing interest in interaction modeling, we hope this report can provide useful insights and foster future research in this direction. Our workshop website can be found at \href{https://rhobin-challenge.github.io/}{https://rhobin-challenge.github.io/}.

RHOBIN Challenge: Reconstruction of Human Object Interaction

TL;DR

RHOBIN presents the first challenge focusing on Reconstruction of Human-Object Interaction from monocular RGB imagery, organized around three tracks: 3D human reconstruction, object 6DoF pose estimation, and joint human–object reconstruction, evaluated on the BEHAVE dataset. Across tracks, winning methods leverage 2D–3D correspondences (dense NOCS maps or keypoints) and often employ a two-stage strategy for joint tasks, combining regression with optimization or pose fitting. The results show substantial progress over baselines in all tracks, with human reconstruction nearing maturity under heavy occlusion, while object pose and joint reconstruction continue to benefit from better data augmentation, model ensembles, and explicit correspondence modeling. The paper argues for future work in temporal/video settings, template-free object handling, and extending the framework to more complex scenes with multiple humans and objects, to advance robust HOI reconstruction in the wild.

Abstract

Modeling the interaction between humans and objects has been an emerging research direction in recent years. Capturing human-object interaction is however a very challenging task due to heavy occlusion and complex dynamics, which requires understanding not only 3D human pose, and object pose but also the interaction between them. Reconstruction of 3D humans and objects has been two separate research fields in computer vision for a long time. We hence proposed the first RHOBIN challenge: reconstruction of human-object interactions in conjunction with the RHOBIN workshop. It was aimed at bringing the research communities of human and object reconstruction as well as interaction modeling together to discuss techniques and exchange ideas. Our challenge consists of three tracks of 3D reconstruction from monocular RGB images with a focus on dealing with challenging interaction scenarios. Our challenge attracted more than 100 participants with more than 300 submissions, indicating the broad interest in the research communities. This paper describes the settings of our challenge and discusses the winning methods of each track in more detail. We observe that the human reconstruction task is becoming mature even under heavy occlusion settings while object pose estimation and joint reconstruction remain challenging tasks. With the growing interest in interaction modeling, we hope this report can provide useful insights and foster future research in this direction. Our workshop website can be found at \href{https://rhobin-challenge.github.io/}{https://rhobin-challenge.github.io/}.
Paper Structure (47 sections, 2 equations, 7 figures, 5 tables)

This paper contains 47 sections, 2 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Example images and 3D annotations from the BEHAVE bhatnagar22behave dataset. BEHAVE captures realistic human-object interactions in natural environments.
  • Figure 2: Framework of G2DR for object 6DoF pose estimation. Given an input image and a mask, we crop out the region of interest and predict intermediate features to directly regress the rotation and translation. Furthermore, we employ a generator-discriminator network to facilitate the generation of robust results in the presence of heavy occlusions.
  • Figure 3: Visualization of the 2D-3D correspondence defined using the Normalized Object Coordinate Space (NOCS). The object model is normalized inside a unit cube. We then render the model as a 2D image where the color is the corresponding 3D coordinate.
  • Figure 4: Object 6DoF pose estimation on BEHAVE test dataset. We present 3 examples of 6D pose estimation results. Given an input image, we show that the object model can be projected to the correct position based on the estimated pose and demonstrate our method performs well on strong occluded occasions.
  • Figure 5: Overview of the human reconstruction method. We leverage BEV Sun_CVPR2022_BEV for global body center prediction and HybrIK li2021hybrik for accurate SMPL pose regression from predicted 3D joints. The network is trained end to end with losses highlighted in red.
  • ...and 2 more figures