Table of Contents
Fetching ...

DiHuR: Diffusion-Guided Generalizable Human Reconstruction

Jinnan Chen, Chen Li, Gim Hee Lee

TL;DR

DiHuR tackles generalizable 3D human reconstruction and novel view synthesis from sparse multi-view images without 3D supervision. It introduces learnable tokens attached to SMPL vertices to provide a cross-view, identity-agnostic prior for SDF-based geometry, and couples this with a diffusion-based 2D prior via Score Distillation Sampling to enhance clothing detail and fine geometry. A KNN-based fusion of SMPL features, a geometry code for appearance, and a multi-target optimization scheme further regularize the reconstruction under sparse overlap. Evaluations on THuman, ZJU-MoCap, and HuMMan show state-of-the-art performance in both 3D reconstruction and NVS across within-dataset and cross-dataset settings, demonstrating strong generalization to unseen identities and poses.

Abstract

We introduce DiHuR, a novel Diffusion-guided model for generalizable Human 3D Reconstruction and view synthesis from sparse, minimally overlapping images. While existing generalizable human radiance fields excel at novel view synthesis, they often struggle with comprehensive 3D reconstruction. Similarly, directly optimizing implicit Signed Distance Function (SDF) fields from sparse-view images typically yields poor results due to limited overlap. To enhance 3D reconstruction quality, we propose using learnable tokens associated with SMPL vertices to aggregate sparse view features and then to guide SDF prediction. These tokens learn a generalizable prior across different identities in training datasets, leveraging the consistent projection of SMPL vertices onto similar semantic areas across various human identities. This consistency enables effective knowledge transfer to unseen identities during inference. Recognizing SMPL's limitations in capturing clothing details, we incorporate a diffusion model as an additional prior to fill in missing information, particularly for complex clothing geometries. Our method integrates two key priors in a coherent manner: the prior from generalizable feed-forward models and the 2D diffusion prior, and it requires only multi-view image training, without 3D supervision. DiHuR demonstrates superior performance in both within-dataset and cross-dataset generalization settings, as validated on THuman, ZJU-MoCap, and HuMMan datasets compared to existing methods.

DiHuR: Diffusion-Guided Generalizable Human Reconstruction

TL;DR

DiHuR tackles generalizable 3D human reconstruction and novel view synthesis from sparse multi-view images without 3D supervision. It introduces learnable tokens attached to SMPL vertices to provide a cross-view, identity-agnostic prior for SDF-based geometry, and couples this with a diffusion-based 2D prior via Score Distillation Sampling to enhance clothing detail and fine geometry. A KNN-based fusion of SMPL features, a geometry code for appearance, and a multi-target optimization scheme further regularize the reconstruction under sparse overlap. Evaluations on THuman, ZJU-MoCap, and HuMMan show state-of-the-art performance in both 3D reconstruction and NVS across within-dataset and cross-dataset settings, demonstrating strong generalization to unseen identities and poses.

Abstract

We introduce DiHuR, a novel Diffusion-guided model for generalizable Human 3D Reconstruction and view synthesis from sparse, minimally overlapping images. While existing generalizable human radiance fields excel at novel view synthesis, they often struggle with comprehensive 3D reconstruction. Similarly, directly optimizing implicit Signed Distance Function (SDF) fields from sparse-view images typically yields poor results due to limited overlap. To enhance 3D reconstruction quality, we propose using learnable tokens associated with SMPL vertices to aggregate sparse view features and then to guide SDF prediction. These tokens learn a generalizable prior across different identities in training datasets, leveraging the consistent projection of SMPL vertices onto similar semantic areas across various human identities. This consistency enables effective knowledge transfer to unseen identities during inference. Recognizing SMPL's limitations in capturing clothing details, we incorporate a diffusion model as an additional prior to fill in missing information, particularly for complex clothing geometries. Our method integrates two key priors in a coherent manner: the prior from generalizable feed-forward models and the 2D diffusion prior, and it requires only multi-view image training, without 3D supervision. DiHuR demonstrates superior performance in both within-dataset and cross-dataset generalization settings, as validated on THuman, ZJU-MoCap, and HuMMan datasets compared to existing methods.

Paper Structure

This paper contains 28 sections, 15 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Given 3 views of images with minimal overlap FOV, our method can accurately reconstruct 3D human models.
  • Figure 2: The overview of our proposed DiHuR. Our pipeline mainly consists of three parts: 1) Feature aggregation, 2) SDF prediction, and 3) Appearance prediction. Our learnable tokens serve as query tokens to aggregate sparse view features, which is then interpolated by KNN for each query point. Volume rendering is done with aggregated features and mean, variance of the projected features from sparse input views. During inference, we finetune our model with normal SDS loss to enhance the details.
  • Figure 3: We show how to compute SDS loss on the pre-trained super-resolution diffusion model: the low-resolution conditioned image on the right is generated from our fixed model.
  • Figure 4: From left to right are reconstruction results from NeuSneus21, PIFuHDpifuhd20, SparseNeuSsparseneus22, GP-NeRFgpnerf22, SiTHsith24, SIFUsifu24, Ours, and ground truth. For single view reconstruction methodspifuhd20sith24sifu24, we choose the front view as input.
  • Figure 5: Visual comparison on THuman dataset. From left to right are results from SparseNeuSsparseneus22, MPS-NeRFmps22, GP-NeRFgpnerf22, SiTHsith24, SIFUsifu24, Ours and ground truth.
  • ...and 3 more figures