Realistic Clothed Human and Object Joint Reconstruction from a Single Image
Ayushi Dutta, Marco Pesavento, Marco Volino, Adrian Hilton, Armin Mustafa
TL;DR
This work tackles single-image joint 3D reconstruction of realistic clothed humans and objects. It introduces ReCHOR, which combines an attention-based neural implicit model with a diffusion-based occlusion inpainting module and pose-prior conditioning to produce detailed $s_h$ and $s_o$ surfaces in a shared 3D frame, extracting surfaces as the zero level-set of occupancies via Marching Cubes. Trained on synthetic synHOR and evaluated on BEHAVE, ReCHOR achieves superior quantitative and qualitative performance over parametric and other neural-implicit baselines, with ablations highlighting the importance of global/local context fusion, the inpainting module, and input combinations. This approach advances realistic avatar creation and scene understanding for applications in AR/VR, films, and interactive media, and suggests future improvements in object texture and template-free priors.
Abstract
Recent approaches to jointly reconstruct 3D humans and objects from a single RGB image represent 3D shapes with template-based or coarse models, which fail to capture details of loose clothing on human bodies. In this paper, we introduce a novel implicit approach for jointly reconstructing realistic 3D clothed humans and objects from a monocular view. For the first time, we model both the human and the object with an implicit representation, allowing to capture more realistic details such as clothing. This task is extremely challenging due to human-object occlusions and the lack of 3D information in 2D images, often leading to poor detail reconstruction and depth ambiguity. To address these problems, we propose a novel attention-based neural implicit model that leverages image pixel alignment from both the input human-object image for a global understanding of the human-object scene and from local separate views of the human and object images to improve realism with, for example, clothing details. Additionally, the network is conditioned on semantic features derived from an estimated human-object pose prior, which provides 3D spatial information about the shared space of humans and objects. To handle human occlusion caused by objects, we use a generative diffusion model that inpaints the occluded regions, recovering otherwise lost details. For training and evaluation, we introduce a synthetic dataset featuring rendered scenes of inter-occluded 3D human scans and diverse objects. Extensive evaluation on both synthetic and real-world datasets demonstrates the superior quality of the proposed human-object reconstructions over competitive methods.
