Table of Contents
Fetching ...

Generalizable Human Gaussians from Single-View Image

Jinnan Chen, Chen Li, Jianfeng Zhang, Lingting Zhu, Buzhen Huang, Hanlin Chen, Gim Hee Lee

TL;DR

The paper tackles single-view 3D human reconstruction by learning 3D Gaussians that cover unobserved regions. It introduces a generate-then-refine Human Gaussian Model (HGM) with a dual-branch SMPL-X–informed predictor and a diffusion-based back-view refinement via ControlNet, achieving high-fidelity, view-consistent results. Key contributions include the SMPL-X dual-branch Gaussian prediction, diffusion-guided back-view refinement, iterative SMPL-X refinement, and fast two-view fusion that yields state-of-the-art novel view synthesis and 3D reconstruction without 3D supervision. The approach offers robust generalization to unseen identities and in-the-wild images, with practical implications for AR/VR, film, and games due to its fast rendering and high-detail outputs.

Abstract

In this work, we tackle the task of learning 3D human Gaussians from a single image, focusing on recovering detailed appearance and geometry including unobserved regions. We introduce a single-view generalizable Human Gaussian Model (HGM), which employs a novel generate-then-refine pipeline with the guidance from human body prior and diffusion prior. Our approach uses a ControlNet to refine rendered back-view images from coarse predicted human Gaussians, then uses the refined image along with the input image to reconstruct refined human Gaussians. To mitigate the potential generation of unrealistic human poses and shapes, we incorporate human priors from the SMPL-X model as a dual branch, propagating image features from the SMPL-X volume to the image Gaussians using sparse convolution and attention mechanisms. Given that the initial SMPL-X estimation might be inaccurate, we gradually refine it with our HGM model. We validate our approach on several publicly available datasets. Our method surpasses previous methods in both novel view synthesis and surface reconstruction. Our approach also exhibits strong generalization for cross-dataset evaluation and in-the-wild images.

Generalizable Human Gaussians from Single-View Image

TL;DR

The paper tackles single-view 3D human reconstruction by learning 3D Gaussians that cover unobserved regions. It introduces a generate-then-refine Human Gaussian Model (HGM) with a dual-branch SMPL-X–informed predictor and a diffusion-based back-view refinement via ControlNet, achieving high-fidelity, view-consistent results. Key contributions include the SMPL-X dual-branch Gaussian prediction, diffusion-guided back-view refinement, iterative SMPL-X refinement, and fast two-view fusion that yields state-of-the-art novel view synthesis and 3D reconstruction without 3D supervision. The approach offers robust generalization to unseen identities and in-the-wild images, with practical implications for AR/VR, film, and games due to its fast rendering and high-detail outputs.

Abstract

In this work, we tackle the task of learning 3D human Gaussians from a single image, focusing on recovering detailed appearance and geometry including unobserved regions. We introduce a single-view generalizable Human Gaussian Model (HGM), which employs a novel generate-then-refine pipeline with the guidance from human body prior and diffusion prior. Our approach uses a ControlNet to refine rendered back-view images from coarse predicted human Gaussians, then uses the refined image along with the input image to reconstruct refined human Gaussians. To mitigate the potential generation of unrealistic human poses and shapes, we incorporate human priors from the SMPL-X model as a dual branch, propagating image features from the SMPL-X volume to the image Gaussians using sparse convolution and attention mechanisms. Given that the initial SMPL-X estimation might be inaccurate, we gradually refine it with our HGM model. We validate our approach on several publicly available datasets. Our method surpasses previous methods in both novel view synthesis and surface reconstruction. Our approach also exhibits strong generalization for cross-dataset evaluation and in-the-wild images.
Paper Structure (27 sections, 6 equations, 11 figures, 7 tables)

This paper contains 27 sections, 6 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Our method reconstructs detailed and geometrically consistent human Gaussian models from single view images, including loosing clothes, challenging pose and in-the-wild images.
  • Figure 2: Our framework and HGM model. (Top) Our framework consists of three steps: 1) Coarse Gaussians prediction with iterative SMPL-X refinement. 2) Back view refinement with ControlNet. 3) Two view reconstruction to get the refined $G_{refine}$. (Bottom) Our HGM model consists of two branches: Image Gaussians prediction by $\operatorname{UNet}$ and adding additional structural features extracted from SMPL-X branch. $f_{smpl}$ are sampled by the Gaussian centers from the SMPL-X volume $S^{vol}$ and fused with $f_{u}$ to the fusion transformer $\operatorname{Tr}_{mix}$ to obtain the Gaussian output.
  • Figure 3: Left: Our SMPL-X refinement pipeline. Right: Our back-view refinement ControlNet.
  • Figure 4: Our back-view refinement can generate more realistic back-view images, compared with back-view hallucination of SiTH sith24.
  • Figure 5: Levraging our HGM model, SMPL-X parameters are iteratively refined to mitigate the issue of blended legs commonly seen in other approaches.
  • ...and 6 more figures