Table of Contents
Fetching ...

Template-Free Single-View 3D Human Digitalization with Diffusion-Guided LRM

Zhenzhen Weng, Jingyuan Liu, Hao Tan, Zhan Xu, Yang Zhou, Serena Yeung-Levy, Jimei Yang

TL;DR

Human-LRM is presented, a diffusion-guided feed-forward model that predicts the implicit field of a human from a single image and is able to capture human without any template prior, e.g., SMPL, and effectively enhance occluded parts with rich and realistic details.

Abstract

Reconstructing 3D humans from a single image has been extensively investigated. However, existing approaches often fall short on capturing fine geometry and appearance details, hallucinating occluded parts with plausible details, and achieving generalization across unseen and in-the-wild datasets. We present Human-LRM, a diffusion-guided feed-forward model that predicts the implicit field of a human from a single image. Leveraging the power of the state-of-the-art reconstruction model (i.e., LRM) and generative model (i.e Stable Diffusion), our method is able to capture human without any template prior, e.g., SMPL, and effectively enhance occluded parts with rich and realistic details. Our approach first uses a single-view LRM model with an enhanced geometry decoder to get the triplane NeRF representation. The novel view renderings from the triplane NeRF provide strong geometry and color prior, from which we generate photo-realistic details for the occluded parts using a diffusion model. The generated multiple views then enable reconstruction with high-quality geometry and appearance, leading to superior overall performance comparing to all existing human reconstruction methods.

Template-Free Single-View 3D Human Digitalization with Diffusion-Guided LRM

TL;DR

Human-LRM is presented, a diffusion-guided feed-forward model that predicts the implicit field of a human from a single image and is able to capture human without any template prior, e.g., SMPL, and effectively enhance occluded parts with rich and realistic details.

Abstract

Reconstructing 3D humans from a single image has been extensively investigated. However, existing approaches often fall short on capturing fine geometry and appearance details, hallucinating occluded parts with plausible details, and achieving generalization across unseen and in-the-wild datasets. We present Human-LRM, a diffusion-guided feed-forward model that predicts the implicit field of a human from a single image. Leveraging the power of the state-of-the-art reconstruction model (i.e., LRM) and generative model (i.e Stable Diffusion), our method is able to capture human without any template prior, e.g., SMPL, and effectively enhance occluded parts with rich and realistic details. Our approach first uses a single-view LRM model with an enhanced geometry decoder to get the triplane NeRF representation. The novel view renderings from the triplane NeRF provide strong geometry and color prior, from which we generate photo-realistic details for the occluded parts using a diffusion model. The generated multiple views then enable reconstruction with high-quality geometry and appearance, leading to superior overall performance comparing to all existing human reconstruction methods.
Paper Structure (22 sections, 3 equations, 11 figures, 6 tables)

This paper contains 22 sections, 3 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: We present Human-LRM, a template-free large reconstruction model for feed-forward 3D human digitalization from a single image. Trained on a vast dataset comprising multi-view capture and 3D scans, our model generalizes across a broader range of scenarios. Guided by dense novel views generated by a conditional diffusion model, our model can generate high-fidelity full body humans from a single image. Our project webpage is at https://zzweng.github.io/humanlrm.
  • Figure 2: Comparison of Human-LRM with SoTA single-view human reconstruction methods on in-the-wild images. Compared to volumetric reconstruction methods, our method achieves superior generalizability to challenging poses (a) and higher fidelity appearance prediction (b). Compared to generalizable human NeRF methods (c), our result achieves much better geometry quality.
  • Figure 3: Overview of Human-LRM. Given a single image, we encode the image using ViT caron2021emerging, and employ a transformer to decode a triplane representation chan2022efficient, followed by SDF and RGB MLPs for volumetric rendering of RGB and depths from novel viewpoints. Next, we use a conditional diffusion model to generate novel-views of the person conditioning on the coarse geometry renderings. From the dense views generated by the diffusion model, we then use a multi-view reconstruction model to generate reconstruction of the person with fine geometry and textures.
  • Figure 4: Geometry and appearance comparison with PIFu saito2019pifu, GTA zhang2024global and SIFU zhang2023sifu on in-the-wild images.
  • Figure 5: Comparison of our single-view reconstruction model to previous volumetric reconstruction methods: PIFu saito2019pifu, PIFu-HD saito2020pifuhd, ECON xiu2023econ, LRM hong2023lrm, GTA zhang2024global, and SIFU zhang2023sifu. All models are trained on THuman 2.0. For each example we show the geometry (colored by vertex normals) from 4 views.
  • ...and 6 more figures