Generalizable Human Gaussians for Sparse View Synthesis

Youngjoong Kwon; Baole Fang; Yixing Lu; Haoye Dong; Cheng Zhang; Francisco Vicente Carrasco; Albert Mosella-Montoro; Jianjin Xu; Shingo Takagi; Daeil Kim; Aayush Prakash; Fernando De la Torre

Generalizable Human Gaussians for Sparse View Synthesis

Youngjoong Kwon, Baole Fang, Yixing Lu, Haoye Dong, Cheng Zhang, Francisco Vicente Carrasco, Albert Mosella-Montoro, Jianjin Xu, Shingo Takagi, Daeil Kim, Aayush Prakash, Fernando De la Torre

TL;DR

This work tackles sparse-view generalization for photorealistic human rendering by reframing 3D Gaussian fitting as regression on the 2D UV space of a human template, anchored to SMPL-X geometry. It introduces Generalizable Human Gaussians (GHG) with a multi-scaffold representation, enabling accurate, fast, and test-time-optimization-free synthesis of novel views for unseen subjects. The method combines geometry- and appearance-conditioned 2D CNNs to generate Gaussian parameter maps, and leverages an inpainting module to fill unobserved regions, achieving strong within-dataset and cross-dataset generalization against NeRF-based and Gaussian baselines. Empirical results show improved perceptual quality (LPIPS, FID) and competitive PSNR, with a significant speed advantage over NeRF-based approaches, underscoring the practical potential for real-time or near-real-time digital humans in AR/VR and media production.

Abstract

Recent progress in neural rendering has brought forth pioneering methods, such as NeRF and Gaussian Splatting, which revolutionize view rendering across various domains like AR/VR, gaming, and content creation. While these methods excel at interpolating {\em within the training data}, the challenge of generalizing to new scenes and objects from very sparse views persists. Specifically, modeling 3D humans from sparse views presents formidable hurdles due to the inherent complexity of human geometry, resulting in inaccurate reconstructions of geometry and textures. To tackle this challenge, this paper leverages recent advancements in Gaussian Splatting and introduces a new method to learn generalizable human Gaussians that allows photorealistic and accurate view-rendering of a new human subject from a limited set of sparse views in a feed-forward manner. A pivotal innovation of our approach involves reformulating the learning of 3D Gaussian parameters into a regression process defined on the 2D UV space of a human template, which allows leveraging the strong geometry prior and the advantages of 2D convolutions. In addition, a multi-scaffold is proposed to effectively represent the offset details. Our method outperforms recent methods on both within-dataset generalization as well as cross-dataset generalization settings.

Generalizable Human Gaussians for Sparse View Synthesis

TL;DR

Abstract

Paper Structure (24 sections, 10 equations, 11 figures, 8 tables)

This paper contains 24 sections, 10 equations, 11 figures, 8 tables.

Introduction
Related Work
Generalizable Human Gaussians (GHG)
Background and Motivation
Learning 3D Gaussians in 2D Human UV Space
Modeling Geometric Details with Multi-scaffolds
Training and Optimization
Experiments
Baselines, Datasets, and Metrics
Comparison with NeRF-based methods
Comparison with Gaussian Splatting-based methods
Ablation Studies and Analyses
Conclusion
Appendix - Overview
Limitations and Future Works
...and 9 more sections

Figures (11)

Figure 1: Generalizable Human Gaussian (GHG). Our method can perform accurate and photorealistic novel view renderings of a new human subject given very sparse inputs (e.g., 3 views) without involving any test-time optimization or fine-tuning. In the sparse-view setup, our GHG approach exhibits superior rendering quality compared to other generalizable methods such as NHP kwon2021neural and GPS-Gaussian zheng2023gps.
Figure 2: Overview of GHG. (a) We focus on generalizable human rendering under very sparse view setting. (b) We first construct the multi-scaffolds by dilating the human template surface. The 2D UV space of each scaffold serves to collect the geometry and appearance information from the corresponding 3D locations. (c) The aggregated multi-scaffold input is fed into the network, which generates multi-Gaussian parameter maps. (d) Finally, Gaussians are anchored on the corresponding surface of each scaffold, and rasterized into novel views.
Figure 3: Illustration of multi-scaffold representation. Each column shows different scaffold levels, with the last column illustrating their combined effect. The top part shows the RGB representation, while the bottom part highlights affected regions, with grey indicating unaffected areas.
Figure 4: Qualitative comparisons. All methods are trained and tested on THuman dataset thuman. $\dagger$Unlike the other methods, Vanilla-GS kerbl20233d is per-subject optimized on the testing subjects. *GPS-Gaussian zheng2023gps is trained and tested with 5 input views, whereas NHP kwon2021neural, NIA kwon2023neural and our method are trained and tested with 3 input views.
Figure 5: Qualitative results on cross-domain generalization. We train the models on THuman dataset thuman and test on Renderpeople dataset renderpeople without model finetuning. GHG can render high-frequent details and accurate geometry of the novel subject.
...and 6 more figures

Generalizable Human Gaussians for Sparse View Synthesis

TL;DR

Abstract

Generalizable Human Gaussians for Sparse View Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (11)