Table of Contents
Fetching ...

Reconstructing In-the-Wild Open-Vocabulary Human-Object Interactions

Boran Wen, Dingbang Huang, Zichen Zhang, Jiahong Zhou, Jianbin Deng, Jingyu Gong, Yulong Chen, Lizhuang Ma, Yong-Lu Li

TL;DR

This work annotated 2.5k+ 3D HOI assets from existing 2D HOI datasets and built the first open-vocabulary in-the-wild 3D HOI dataset Open3DHOI, to serve as a future test set and designs a novel Gaussian-HOI optimizer, which efficiently reconstructs the spatial interactions between humans and objects while learning the contact regions.

Abstract

Reconstructing human-object interactions (HOI) from single images is fundamental in computer vision. Existing methods are primarily trained and tested on indoor scenes due to the lack of 3D data, particularly constrained by the object variety, making it challenging to generalize to real-world scenes with a wide range of objects. The limitations of previous 3D HOI datasets were primarily due to the difficulty in acquiring 3D object assets. However, with the development of 3D reconstruction from single images, recently it has become possible to reconstruct various objects from 2D HOI images. We therefore propose a pipeline for annotating fine-grained 3D humans, objects, and their interactions from single images. We annotated 2.5k+ 3D HOI assets from existing 2D HOI datasets and built the first open-vocabulary in-the-wild 3D HOI dataset Open3DHOI, to serve as a future test set. Moreover, we design a novel Gaussian-HOI optimizer, which efficiently reconstructs the spatial interactions between humans and objects while learning the contact regions. Besides the 3D HOI reconstruction, we also propose several new tasks for 3D HOI understanding to pave the way for future work. Data and code will be publicly available at https://wenboran2002.github.io/3dhoi.

Reconstructing In-the-Wild Open-Vocabulary Human-Object Interactions

TL;DR

This work annotated 2.5k+ 3D HOI assets from existing 2D HOI datasets and built the first open-vocabulary in-the-wild 3D HOI dataset Open3DHOI, to serve as a future test set and designs a novel Gaussian-HOI optimizer, which efficiently reconstructs the spatial interactions between humans and objects while learning the contact regions.

Abstract

Reconstructing human-object interactions (HOI) from single images is fundamental in computer vision. Existing methods are primarily trained and tested on indoor scenes due to the lack of 3D data, particularly constrained by the object variety, making it challenging to generalize to real-world scenes with a wide range of objects. The limitations of previous 3D HOI datasets were primarily due to the difficulty in acquiring 3D object assets. However, with the development of 3D reconstruction from single images, recently it has become possible to reconstruct various objects from 2D HOI images. We therefore propose a pipeline for annotating fine-grained 3D humans, objects, and their interactions from single images. We annotated 2.5k+ 3D HOI assets from existing 2D HOI datasets and built the first open-vocabulary in-the-wild 3D HOI dataset Open3DHOI, to serve as a future test set. Moreover, we design a novel Gaussian-HOI optimizer, which efficiently reconstructs the spatial interactions between humans and objects while learning the contact regions. Besides the 3D HOI reconstruction, we also propose several new tasks for 3D HOI understanding to pave the way for future work. Data and code will be publicly available at https://wenboran2002.github.io/3dhoi.

Paper Structure

This paper contains 41 sections, 7 equations, 22 figures, 8 tables, 1 algorithm.

Figures (22)

  • Figure 1: We aim to reconstruct 3D HOIs from arbitrary open-world images. We propose a pipeline for annotating fine-grained reconstructions to build a dataset. Additionally, we introduce a new optimizer suitable for reconstructing arbitrary objects.
  • Figure 2: Coarse Reconstruction. We first obtain depth from the images and generate point clouds. Given masks, we extract the corresponding point clouds for the person (pink) and object (blue). We obtain a rough reconstruction by matching the MESH vertices of the person and the object with the depth point cloud.
  • Figure 3: Annotation Pipeline. (a) Filtering. Given the reconstructed human and object meshes, annotators assess the quality. If the human reconstruction is eligible, the contact area is further annotated. If the object reconstruction fails, the mask is redrawn manually and the reconstruction is performed again. (b) Given the 3D human interaction through coarse reconstruction, we adjust the object position in Blender. For example, the rough annotation of the couch and the human body shows a mesh collision. We move the object to make sure the person is correctly seated on the couch. (c) We use a fine annotation tool to further align the annotated human and object with the image.
  • Figure 4: Object category distribution in Open3DHOI. It encompasses a wide range of object categories.
  • Figure 5: Our pipeline. The optimizer first converts the human and object into 3D Gaussian points, then calculates a rendering loss by comparing the Gaussian-rendered image with the ground truth image. This loss is backpropagated to update the object’s pose parameters and the human’s LBS parameters. We also calculate an HOI loss, which includes collision, depth and contact losses, the red overlapping areas between the human and object in the image represent collision regions and the dashed lines represent the ground truth depth and the depth during the optimization process. Finally, we refine the result by optimizing the contact regions.
  • ...and 17 more figures