Table of Contents
Fetching ...

HumanSplat: Generalizable Single-Image Human Gaussian Splatting with Structure Priors

Panwang Pan, Zhuo Su, Chenguo Lin, Zhen Fan, Yongjie Zhang, Zeming Li, Tingting Shen, Yadong Mu, Yebin Liu

TL;DR

This work presents HumanSplat, which predicts the 3D Gaussian Splatting properties of any human from a single input image in a generalizable manner and surpasses existing state-of-the-art methods in achieving photorealistic novel-view synthesis.

Abstract

Despite recent advancements in high-fidelity human reconstruction techniques, the requirements for densely captured images or time-consuming per-instance optimization significantly hinder their applications in broader scenarios. To tackle these issues, we present HumanSplat which predicts the 3D Gaussian Splatting properties of any human from a single input image in a generalizable manner. In particular, HumanSplat comprises a 2D multi-view diffusion model and a latent reconstruction transformer with human structure priors that adeptly integrate geometric priors and semantic features within a unified framework. A hierarchical loss that incorporates human semantic information is further designed to achieve high-fidelity texture modeling and better constrain the estimated multiple views. Comprehensive experiments on standard benchmarks and in-the-wild images demonstrate that HumanSplat surpasses existing state-of-the-art methods in achieving photorealistic novel-view synthesis.

HumanSplat: Generalizable Single-Image Human Gaussian Splatting with Structure Priors

TL;DR

This work presents HumanSplat, which predicts the 3D Gaussian Splatting properties of any human from a single input image in a generalizable manner and surpasses existing state-of-the-art methods in achieving photorealistic novel-view synthesis.

Abstract

Despite recent advancements in high-fidelity human reconstruction techniques, the requirements for densely captured images or time-consuming per-instance optimization significantly hinder their applications in broader scenarios. To tackle these issues, we present HumanSplat which predicts the 3D Gaussian Splatting properties of any human from a single input image in a generalizable manner. In particular, HumanSplat comprises a 2D multi-view diffusion model and a latent reconstruction transformer with human structure priors that adeptly integrate geometric priors and semantic features within a unified framework. A hierarchical loss that incorporates human semantic information is further designed to achieve high-fidelity texture modeling and better constrain the estimated multiple views. Comprehensive experiments on standard benchmarks and in-the-wild images demonstrate that HumanSplat surpasses existing state-of-the-art methods in achieving photorealistic novel-view synthesis.
Paper Structure (20 sections, 6 equations, 11 figures, 3 tables)

This paper contains 20 sections, 6 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Our method achieves state-of-the-art rendering quality while maintaining the fastest runtime. (a) Qualitative results: LGM LGM:2024 and GTA zhang2023globalcorrelated are generalizable but in lower quality, TeCH TECH:2023 exhibits issues with multi-face rendering and is time-consuming. In contrast, our method achieves higher fidelity in a much shorter time. (b) Performance and runtime comparison: metrics are evaluated on the challenging Twindom dataset.
  • Figure 2: Overview of HumanSplat. (a) Multi-view latent features are first generated by a fine-tuned multi-view diffusion model (Novel View Synthesizer in Sec. \ref{['Texture_arch']}). (b) Then, the Latent Reconstruction Transformer (Sec. \ref{['transformer']}) interacts global latent features (Sec. \ref{['global interaction']}) and human geometric prior (Sec. \ref{['Human Priors']}). (c) Finally, the semantic-guided objectives (Sec. \ref{['loss']}) are proposed to reconstruct the final human 3DGS.
  • Figure 3: Illustration of latent reconstruction Transformer. It first divides $\mathbf{F}_0$ and $\mathbf{F}_i$ into non-overlapping patches, which are then processed through an intra-attention module (Sec. \ref{['global interaction']}). Within the iter-attention module (Sec. \ref{['Human Priors']}), we introduce the projection-aware attention with a window $\mathbf{W}(K_{\text{win}} \times K_{\text{win}})$, and the attributes of 3D Gaussians are decoded with a Conv $1\times1$ layer.
  • Figure 4: Qualitative comparison of ours against TeCH TECH:2023, GTA zhang2023globalcorrelated and LGM LGM:2024 on Thuman2.0 THuman2.0:2021, Twindom Twindom and in-the-wild images. Our method achieves the highest quality. Note that TeCH achieves clearer results but fails to preserve the face identity.
  • Figure 5: Qualitative results showcasing reconstructions of humans in challenging poses, diverse identities, and varying camera viewpoints from in-the-wild images.
  • ...and 6 more figures