Table of Contents
Fetching ...

Realistic Clothed Human and Object Joint Reconstruction from a Single Image

Ayushi Dutta, Marco Pesavento, Marco Volino, Adrian Hilton, Armin Mustafa

TL;DR

This work tackles single-image joint 3D reconstruction of realistic clothed humans and objects. It introduces ReCHOR, which combines an attention-based neural implicit model with a diffusion-based occlusion inpainting module and pose-prior conditioning to produce detailed $s_h$ and $s_o$ surfaces in a shared 3D frame, extracting surfaces as the zero level-set of occupancies via Marching Cubes. Trained on synthetic synHOR and evaluated on BEHAVE, ReCHOR achieves superior quantitative and qualitative performance over parametric and other neural-implicit baselines, with ablations highlighting the importance of global/local context fusion, the inpainting module, and input combinations. This approach advances realistic avatar creation and scene understanding for applications in AR/VR, films, and interactive media, and suggests future improvements in object texture and template-free priors.

Abstract

Recent approaches to jointly reconstruct 3D humans and objects from a single RGB image represent 3D shapes with template-based or coarse models, which fail to capture details of loose clothing on human bodies. In this paper, we introduce a novel implicit approach for jointly reconstructing realistic 3D clothed humans and objects from a monocular view. For the first time, we model both the human and the object with an implicit representation, allowing to capture more realistic details such as clothing. This task is extremely challenging due to human-object occlusions and the lack of 3D information in 2D images, often leading to poor detail reconstruction and depth ambiguity. To address these problems, we propose a novel attention-based neural implicit model that leverages image pixel alignment from both the input human-object image for a global understanding of the human-object scene and from local separate views of the human and object images to improve realism with, for example, clothing details. Additionally, the network is conditioned on semantic features derived from an estimated human-object pose prior, which provides 3D spatial information about the shared space of humans and objects. To handle human occlusion caused by objects, we use a generative diffusion model that inpaints the occluded regions, recovering otherwise lost details. For training and evaluation, we introduce a synthetic dataset featuring rendered scenes of inter-occluded 3D human scans and diverse objects. Extensive evaluation on both synthetic and real-world datasets demonstrates the superior quality of the proposed human-object reconstructions over competitive methods.

Realistic Clothed Human and Object Joint Reconstruction from a Single Image

TL;DR

This work tackles single-image joint 3D reconstruction of realistic clothed humans and objects. It introduces ReCHOR, which combines an attention-based neural implicit model with a diffusion-based occlusion inpainting module and pose-prior conditioning to produce detailed and surfaces in a shared 3D frame, extracting surfaces as the zero level-set of occupancies via Marching Cubes. Trained on synthetic synHOR and evaluated on BEHAVE, ReCHOR achieves superior quantitative and qualitative performance over parametric and other neural-implicit baselines, with ablations highlighting the importance of global/local context fusion, the inpainting module, and input combinations. This approach advances realistic avatar creation and scene understanding for applications in AR/VR, films, and interactive media, and suggests future improvements in object texture and template-free priors.

Abstract

Recent approaches to jointly reconstruct 3D humans and objects from a single RGB image represent 3D shapes with template-based or coarse models, which fail to capture details of loose clothing on human bodies. In this paper, we introduce a novel implicit approach for jointly reconstructing realistic 3D clothed humans and objects from a monocular view. For the first time, we model both the human and the object with an implicit representation, allowing to capture more realistic details such as clothing. This task is extremely challenging due to human-object occlusions and the lack of 3D information in 2D images, often leading to poor detail reconstruction and depth ambiguity. To address these problems, we propose a novel attention-based neural implicit model that leverages image pixel alignment from both the input human-object image for a global understanding of the human-object scene and from local separate views of the human and object images to improve realism with, for example, clothing details. Additionally, the network is conditioned on semantic features derived from an estimated human-object pose prior, which provides 3D spatial information about the shared space of humans and objects. To handle human occlusion caused by objects, we use a generative diffusion model that inpaints the occluded regions, recovering otherwise lost details. For training and evaluation, we introduce a synthetic dataset featuring rendered scenes of inter-occluded 3D human scans and diverse objects. Extensive evaluation on both synthetic and real-world datasets demonstrates the superior quality of the proposed human-object reconstructions over competitive methods.

Paper Structure

This paper contains 20 sections, 8 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: ReCHOR jointly reconstructs realistic clothed humans and objects from synthetic (a) and real (b) images by first handling human occlusion with a conditioned generative model followed by attention-based neural implicit model estimation.
  • Figure 2: ReCHOR overview: Given an input image of a human-object scene, we first use a generative model to inpaint occluded human body regions, guided by a mask of missing areas and the segmented input of the human. Next, the generated image, along with an estimated normal map, the input image, the segmented object image, and estimated pose parameters, are processed by an attention-based neural implicit model. This model jointly estimates the implicit representation of the human-object shape.
  • Figure 3: The attention-based neural implicit model first extracts pixel-aligned features from the input images to capture local details. It then uses a transformer encoder to merge these features, learning global and local contextual information about the scene. Finally, the model estimates the implicit representations for both humans and objects. Human-object pose priors provide 3D spatial information to address depth ambiguity.
  • Figure 4: Qualitative evaluations against methods which aim to reconstruct human-object jointly with examples from BEHAVE dataset bhatnagar2022behave. Note that HDM generates point clouds rather than meshes. Front and side views are shown.
  • Figure 5: Visual comparisons from synHOR dataset with approaches that aim to reconstruct 3D humans as well as with baselines designed for fair comparisons. Front and side views are shown.
  • ...and 11 more figures