Table of Contents
Fetching ...

RoGSplat: Learning Robust Generalizable Human Gaussian Splatting from Sparse Multi-View Images

Junjin Xiao, Qing Zhang, Yonewei Nie, Lei Zhu, Wei-Shi Zheng

TL;DR

RoGSplat tackles robust generalizable human novel-view synthesis from sparse multi-view images without per-subject optimization. It first lifts SMPL vertices to dense image-aligned 3D prior points using a SPD-based fusion of pixel- and voxel-level features, then regresses coarse 3D Gaussians and refines them with a coarse-to-fine pixel-wise Gaussian strategy guided by depth refinements. The training uses a two-stage scheme with geometry- and texture-focused losses, plus a depth refiner, achieving real-time-like inference and strong cross-dataset generalization. Empirically, RoGSplat outperforms state-of-the-art NeRF-based and 3D Gaussian Splatting methods on multiple benchmarks and demonstrates robustness to SMPL misalignment while highlighting areas for improvement in loose clothing and facial detail reconstruction.

Abstract

This paper presents RoGSplat, a novel approach for synthesizing high-fidelity novel views of unseen human from sparse multi-view images, while requiring no cumbersome per-subject optimization. Unlike previous methods that typically struggle with sparse views with few overlappings and are less effective in reconstructing complex human geometry, the proposed method enables robust reconstruction in such challenging conditions. Our key idea is to lift SMPL vertices to dense and reliable 3D prior points representing accurate human body geometry, and then regress human Gaussian parameters based on the points. To account for possible misalignment between SMPL model and images, we propose to predict image-aligned 3D prior points by leveraging both pixel-level features and voxel-level features, from which we regress the coarse Gaussians. To enhance the ability to capture high-frequency details, we further render depth maps from the coarse 3D Gaussians to help regress fine-grained pixel-wise Gaussians. Experiments on several benchmark datasets demonstrate that our method outperforms state-of-the-art methods in novel view synthesis and cross-dataset generalization. Our code is available at https://github.com/iSEE-Laboratory/RoGSplat.

RoGSplat: Learning Robust Generalizable Human Gaussian Splatting from Sparse Multi-View Images

TL;DR

RoGSplat tackles robust generalizable human novel-view synthesis from sparse multi-view images without per-subject optimization. It first lifts SMPL vertices to dense image-aligned 3D prior points using a SPD-based fusion of pixel- and voxel-level features, then regresses coarse 3D Gaussians and refines them with a coarse-to-fine pixel-wise Gaussian strategy guided by depth refinements. The training uses a two-stage scheme with geometry- and texture-focused losses, plus a depth refiner, achieving real-time-like inference and strong cross-dataset generalization. Empirically, RoGSplat outperforms state-of-the-art NeRF-based and 3D Gaussian Splatting methods on multiple benchmarks and demonstrates robustness to SMPL misalignment while highlighting areas for improvement in loose clothing and facial detail reconstruction.

Abstract

This paper presents RoGSplat, a novel approach for synthesizing high-fidelity novel views of unseen human from sparse multi-view images, while requiring no cumbersome per-subject optimization. Unlike previous methods that typically struggle with sparse views with few overlappings and are less effective in reconstructing complex human geometry, the proposed method enables robust reconstruction in such challenging conditions. Our key idea is to lift SMPL vertices to dense and reliable 3D prior points representing accurate human body geometry, and then regress human Gaussian parameters based on the points. To account for possible misalignment between SMPL model and images, we propose to predict image-aligned 3D prior points by leveraging both pixel-level features and voxel-level features, from which we regress the coarse Gaussians. To enhance the ability to capture high-frequency details, we further render depth maps from the coarse 3D Gaussians to help regress fine-grained pixel-wise Gaussians. Experiments on several benchmark datasets demonstrate that our method outperforms state-of-the-art methods in novel view synthesis and cross-dataset generalization. Our code is available at https://github.com/iSEE-Laboratory/RoGSplat.

Paper Structure

This paper contains 17 sections, 13 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: High-fidelity human novel view synthesis. Given very sparse-view input images (e.g., 4 views) that do not enable accurate human template estimation due to the limited overlappings, our method can robustly synthesize high-fidelity novel views in a generalizable manner, without requiring any further fine-tuning or subject-specific optimization. Compared to both NeRF-based method, e.g., TransHuman Pan_2023_ICCV, and 3D Gaussian Splatting (3DGS) based methods, e.g., vanilla 3DGS 3DGS and GPS-Gaussian zheng2024gpsgaussian, our approach produces better result.
  • Figure 2: Overview of RoGSplat. We first fit the SMPL model from input sparse views, and then feed the SMPL depth into a depth refiner to get refined depth, from which we obtain voxel-level features. These features are then aggregated with pixel-level features extracted from source images, followed by the SPD network xiang2023SPDxiang2021snowflakenet to generate dense image-aligned prior points for coarse Gaussian rasterization. To help model finer details, the image-aligned depth maps from coarse Gaussians are unprojected to yield finer pixel-wise points. These points are then refined by an offset estimator, and finally employed to regress fine-grained Gaussians.
  • Figure 3: Qualitative comparison of in-domain generalization on THuman2.0 tao2021function4d dataset.
  • Figure 4: Qualitative comparison of cross-domain generalization. The results of each method here are produced by their models trained on the THuman2.0 dataset tao2021function4d.
  • Figure 5: Robustness to inaccurate SMPL. Our method, unlike previous methods, can produce visually similar high-fidelity renderings from either inaccurate fitted SMPL or the GT SMPL.
  • ...and 10 more figures