Table of Contents
Fetching ...

Single-image coherent reconstruction of objects and humans

Sarthak Batra, Partha P. Chakrabarti, Simon Hadfield, Armin Mustafa

TL;DR

This work tackles monocular 3D reconstruction of scenes with multiple interacting humans and objects by introducing an optimization-based framework that enforces global spatial coherence. It jointly reasons about human–human and human–object interactions through a collision loss, an occlusion-aware 6-DOF object pose estimation via image inpainting, and a differentiable rendering-based fitting of object exemplars, all without explicit 3D supervision. A final joint optimization yields globally consistent scene layouts with reduced mesh collisions and more realistic interactions, demonstrated on COCO-2017 against state-of-the-art methods. The approach advances single-image scene understanding by producing coherent, collision-free reconstructions in complex real-world images, though it incurs higher computational cost compared to learning-based methods.

Abstract

Existing methods for reconstructing objects and humans from a monocular image suffer from severe mesh collisions and performance limitations for interacting occluding objects. This paper introduces a method to obtain a globally consistent 3D reconstruction of interacting objects and people from a single image. Our contributions include: 1) an optimization framework, featuring a collision loss, tailored to handle human-object and human-human interactions, ensuring spatially coherent scene reconstruction; and 2) a novel technique to robustly estimate 6 degrees of freedom (DOF) poses, specifically for heavily occluded objects, exploiting image inpainting. Notably, our proposed method operates effectively on images from real-world scenarios, without necessitating scene or object-level 3D supervision. Extensive qualitative and quantitative evaluation against existing methods demonstrates a significant reduction in collisions in the final reconstructions of scenes with multiple interacting humans and objects and a more coherent scene reconstruction.

Single-image coherent reconstruction of objects and humans

TL;DR

This work tackles monocular 3D reconstruction of scenes with multiple interacting humans and objects by introducing an optimization-based framework that enforces global spatial coherence. It jointly reasons about human–human and human–object interactions through a collision loss, an occlusion-aware 6-DOF object pose estimation via image inpainting, and a differentiable rendering-based fitting of object exemplars, all without explicit 3D supervision. A final joint optimization yields globally consistent scene layouts with reduced mesh collisions and more realistic interactions, demonstrated on COCO-2017 against state-of-the-art methods. The approach advances single-image scene understanding by producing coherent, collision-free reconstructions in complex real-world images, though it incurs higher computational cost compared to learning-based methods.

Abstract

Existing methods for reconstructing objects and humans from a monocular image suffer from severe mesh collisions and performance limitations for interacting occluding objects. This paper introduces a method to obtain a globally consistent 3D reconstruction of interacting objects and people from a single image. Our contributions include: 1) an optimization framework, featuring a collision loss, tailored to handle human-object and human-human interactions, ensuring spatially coherent scene reconstruction; and 2) a novel technique to robustly estimate 6 degrees of freedom (DOF) poses, specifically for heavily occluded objects, exploiting image inpainting. Notably, our proposed method operates effectively on images from real-world scenarios, without necessitating scene or object-level 3D supervision. Extensive qualitative and quantitative evaluation against existing methods demonstrates a significant reduction in collisions in the final reconstructions of scenes with multiple interacting humans and objects and a more coherent scene reconstruction.
Paper Structure (13 sections, 9 equations, 8 figures, 3 tables)

This paper contains 13 sections, 9 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Comparison of the proposed method (right) reconstruction with PHOSA(middle). The proposed method gives a more coherent reconstruction with correct spatial arrangement by reasoning about human-human and human-object interaction
  • Figure 2: Overview of the proposed method to generate spatially coherent reconstruction from a single image. The steps in red box are novel. The reconstruction before human pose optimization exhibits notable mesh collisions. After human pose optimization, reduced mesh collisions are seen while maintaining relative coherence between humans.
  • Figure 3: The proposed approach gives spatially coherent reconstructions with a significant reduction in mesh collisions compared to PHOSA zhang2020perceiving, ROMPsun2021monocular, and BEVsun2022putting. Significant collision are shown in highlighted circles.
  • Figure 4: Comparison of the segmentation masks and reconstruction with PHOSA. The segmentation mask of the bicycle is occluded resulting in erroneous reconstruction in PHOSA. The proposed method uses image inpainting to remove the occlusion to generate a better segmentation mask, which leads to a more complete reconstruction.
  • Figure 5: Qualitative comparison on test images from COCO 2017 against PHOSA zhang2020perceiving with human-object interactions. Our method gives better spatial reconstruction while substantially reducing collisions(the golden circles delineate regions characterized by noteworthy mesh collisions, while the red circles delineate areas showcasing enhancements in reconstructions). More qualitative results are shown in \ref{['sec:4.3']}
  • ...and 3 more figures