GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis

Shunyuan Zheng; Boyao Zhou; Ruizhi Shao; Boning Liu; Shengping Zhang; Liqiang Nie; Yebin Liu

GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis

Shunyuan Zheng, Boyao Zhou, Ruizhi Shao, Boning Liu, Shengping Zhang, Liqiang Nie, Yebin Liu

TL;DR

This work tackles real-time, high-fidelity novel view synthesis of human performers under sparse-view cameras. It introduces GPS-Gaussian, a generalizable pixel-wise 3D Gaussian Splatting framework that regresses Gaussian parameter maps on source views and unprojects them to 3D via jointly trained depth estimation, all in a differentiable rendering loop. The approach achieves $2K$-resolution rendering at over 25 FPS without fine-tuning and outperforms state-of-the-art methods (ENeRF, FloRen, 3D-GS) on synthetic and real data, while maintaining robust performance under view sparsity and occlusions. By leveraging large-scale human priors and a two-view depth-guided Gaussian representation, GPS-Gaussian enables instant, interactive, and scalable human NVS suitable for applications like holographic displays and immersive media.

Abstract

We present a new approach, termed GPS-Gaussian, for synthesizing novel views of a character in a real-time manner. The proposed method enables 2K-resolution rendering under a sparse-view camera setting. Unlike the original Gaussian Splatting or neural implicit rendering methods that necessitate per-subject optimizations, we introduce Gaussian parameter maps defined on the source views and regress directly Gaussian Splatting properties for instant novel view synthesis without any fine-tuning or optimization. To this end, we train our Gaussian parameter regression module on a large amount of human scan data, jointly with a depth estimation module to lift 2D parameter maps to 3D space. The proposed framework is fully differentiable and experiments on several datasets demonstrate that our method outperforms state-of-the-art methods while achieving an exceeding rendering speed.

GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis

TL;DR

-resolution rendering at over 25 FPS without fine-tuning and outperforms state-of-the-art methods (ENeRF, FloRen, 3D-GS) on synthetic and real data, while maintaining robust performance under view sparsity and occlusions. By leveraging large-scale human priors and a two-view depth-guided Gaussian representation, GPS-Gaussian enables instant, interactive, and scalable human NVS suitable for applications like holographic displays and immersive media.

Abstract

Paper Structure (22 sections, 15 equations, 8 figures, 4 tables)

This paper contains 22 sections, 15 equations, 8 figures, 4 tables.

Introduction
Related Work
Neural Implicit Human Representation.
Deep Image-based Rendering.
Point-based Graphics.
Preliminary
Method
View Selection and Depth Estimation
Pixel-wise Gaussian Parameters Prediction
Joint Training with Differentiable Rendering
Experiments
Implementation Details
Datasets and Metrics
Comparisons with State-of-the-art Methods
Ablation Studies
...and 7 more sections

Figures (8)

Figure 1: High-fidelity and real-time novel view synthesis (NVS). Our proposed method synthesizes $2K$-resolution novel views of unseen human performers in real-time without any fine-tuning or optimization. The performance outperforms the state-of-the-art feed-forward NVS methods ENeRF lin2022enerf, FloRen shao2022floren and 3D-GS kerbl2023_3dgs, which are representative approaches in Implicit Neural Human Rendering, Image-based Human Rendering and per-subject optimization. We only mark the running efficiency for the feed-forward methods.
Figure 2: Overview of GPS-Gaussian. Given RGB images of a human-centered scene with sparse camera views and a target novel viewpoint, we select the adjacent two views on which to formulate our Gaussian representation. We extract the image features followed by conducting an iterative depth estimation. For each source view, the depth map and the RGB image serve as a 3D position map and a color map, respectively, to formulate the Gaussian representation while the other parameters of 3D Gaussians are predicted in a pixel-wise manner. The Gaussian parameter maps defined on 2D image planes of both views are further unprojected to 3D space and aggregated for novel view rendering. The fully differentiable framework enables a joint training mechanism for all networks.
Figure 3: Qualitative comparison on THuman2.0 yu2021function4d, Twindom twindom and our collected real-world data. Our method produces more detailed human appearances and can recover more reasonable geometry.
Figure 4: Qualitative ablation study on synthetic data. We show the effectiveness of the joint training and the depth encoder in the full pipeline. The proposed designs make the rendering results more visually appealing with fewer artifacts and less blurry.
Figure 5: Visualization of opacity maps. (a) One of the source view images. (b) The predicted opacity map related to (a). (c)/(d) The directly projected color/opacity map at novel viewpoint. (e) Novel view rendering results. A cold color in (b) and (d) represents an opacity value near 0, while a hot color near 1. The low opacity values predicted for the outliers make them invisible.
...and 3 more figures

GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis

TL;DR

Abstract

GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (8)