DiHuR: Diffusion-Guided Generalizable Human Reconstruction
Jinnan Chen, Chen Li, Gim Hee Lee
TL;DR
DiHuR tackles generalizable 3D human reconstruction and novel view synthesis from sparse multi-view images without 3D supervision. It introduces learnable tokens attached to SMPL vertices to provide a cross-view, identity-agnostic prior for SDF-based geometry, and couples this with a diffusion-based 2D prior via Score Distillation Sampling to enhance clothing detail and fine geometry. A KNN-based fusion of SMPL features, a geometry code for appearance, and a multi-target optimization scheme further regularize the reconstruction under sparse overlap. Evaluations on THuman, ZJU-MoCap, and HuMMan show state-of-the-art performance in both 3D reconstruction and NVS across within-dataset and cross-dataset settings, demonstrating strong generalization to unseen identities and poses.
Abstract
We introduce DiHuR, a novel Diffusion-guided model for generalizable Human 3D Reconstruction and view synthesis from sparse, minimally overlapping images. While existing generalizable human radiance fields excel at novel view synthesis, they often struggle with comprehensive 3D reconstruction. Similarly, directly optimizing implicit Signed Distance Function (SDF) fields from sparse-view images typically yields poor results due to limited overlap. To enhance 3D reconstruction quality, we propose using learnable tokens associated with SMPL vertices to aggregate sparse view features and then to guide SDF prediction. These tokens learn a generalizable prior across different identities in training datasets, leveraging the consistent projection of SMPL vertices onto similar semantic areas across various human identities. This consistency enables effective knowledge transfer to unseen identities during inference. Recognizing SMPL's limitations in capturing clothing details, we incorporate a diffusion model as an additional prior to fill in missing information, particularly for complex clothing geometries. Our method integrates two key priors in a coherent manner: the prior from generalizable feed-forward models and the 2D diffusion prior, and it requires only multi-view image training, without 3D supervision. DiHuR demonstrates superior performance in both within-dataset and cross-dataset generalization settings, as validated on THuman, ZJU-MoCap, and HuMMan datasets compared to existing methods.
