Table of Contents
Fetching ...

Splat-SAP: Feed-Forward Gaussian Splatting for Human-Centered Scene with Scale-Aware Point Map Reconstruction

Boyao Zhou, Shunyuan Zheng, Zhanfeng Liao, Zihan Ma, Hanzhang Tu, Boning Liu, Yebin Liu

TL;DR

Splat-SAP tackles the challenge of free-view rendering for human-centered scenes from sparse binocular views by introducing a two-stage, feed-forward pipeline. It first learns scale-aware geometry maps in canonical space and then refines them in real space using an affinity-driven, pixel-wise translation, followed by depth refinement and a Gaussian-plane rendering strategy. The method is trained with a self-supervised Stage 1 and a photometric Stage 2, enabling high-quality renderings without 3D supervision and showing strong performance across diverse camera setups. This approach delivers robust, temporally coherent free-view video synthesis under sparse input conditions and offers practical improvements over existing feed-forward and optimization-based methods.

Abstract

We present Splat-SAP, a feed-forward approach to render novel views of human-centered scenes from binocular cameras with large sparsity. Gaussian Splatting has shown its promising potential in rendering tasks, but it typically necessitates per-scene optimization with dense input views. Although some recent approaches achieve feed-forward Gaussian Splatting rendering through geometry priors obtained by multi-view stereo, such approaches still require largely overlapped input views to establish the geometry prior. To bridge this gap, we leverage pixel-wise point map reconstruction to represent geometry which is robust to large sparsity for its independent view modeling. In general, we propose a two-stage learning strategy. In stage 1, we transform the point map into real space via an iterative affinity learning process, which facilitates camera control in the following. In stage 2, we project point maps of two input views onto the target view plane and refine such geometry via stereo matching. Furthermore, we anchor Gaussian primitives on this refined plane in order to render high-quality images. As a metric representation, the scale-aware point map in stage 1 is trained in a self-supervised manner without 3D supervision and stage 2 is supervised with photo-metric loss. We collect multi-view human-centered data and demonstrate that our method improves both the stability of point map reconstruction and the visual quality of free-viewpoint rendering.

Splat-SAP: Feed-Forward Gaussian Splatting for Human-Centered Scene with Scale-Aware Point Map Reconstruction

TL;DR

Splat-SAP tackles the challenge of free-view rendering for human-centered scenes from sparse binocular views by introducing a two-stage, feed-forward pipeline. It first learns scale-aware geometry maps in canonical space and then refines them in real space using an affinity-driven, pixel-wise translation, followed by depth refinement and a Gaussian-plane rendering strategy. The method is trained with a self-supervised Stage 1 and a photometric Stage 2, enabling high-quality renderings without 3D supervision and showing strong performance across diverse camera setups. This approach delivers robust, temporally coherent free-view video synthesis under sparse input conditions and offers practical improvements over existing feed-forward and optimization-based methods.

Abstract

We present Splat-SAP, a feed-forward approach to render novel views of human-centered scenes from binocular cameras with large sparsity. Gaussian Splatting has shown its promising potential in rendering tasks, but it typically necessitates per-scene optimization with dense input views. Although some recent approaches achieve feed-forward Gaussian Splatting rendering through geometry priors obtained by multi-view stereo, such approaches still require largely overlapped input views to establish the geometry prior. To bridge this gap, we leverage pixel-wise point map reconstruction to represent geometry which is robust to large sparsity for its independent view modeling. In general, we propose a two-stage learning strategy. In stage 1, we transform the point map into real space via an iterative affinity learning process, which facilitates camera control in the following. In stage 2, we project point maps of two input views onto the target view plane and refine such geometry via stereo matching. Furthermore, we anchor Gaussian primitives on this refined plane in order to render high-quality images. As a metric representation, the scale-aware point map in stage 1 is trained in a self-supervised manner without 3D supervision and stage 2 is supervised with photo-metric loss. We collect multi-view human-centered data and demonstrate that our method improves both the stability of point map reconstruction and the visual quality of free-viewpoint rendering.

Paper Structure

This paper contains 34 sections, 17 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Human-centered scene reconstruction and free-view video synthesis. (a) Source view inputs, (b) our metric scale point map reconstruction, and (c) free-view rendering with our feed-forward Gaussian Splatting.
  • Figure 2: Overview of Splat-SAP. Our method consists of two stages. In the first stage, we take two coarse images as input and predict corresponding point maps, along with an affine transform. In the second stage, our refinement module takes transformed points and fine-resolution images as input, and predicts Gaussian plane of target view for high-quality rendering.
  • Figure 3: Quantitative comparison of rendering. We show results of (a) ENeRF lin2022enerf, (b) MVSGaussian liu2024mvsgaussian, (c) MVSplat chen2024mvsplat, (d) Ours and (e) Ground Truth for GoPro, Camera and Mobile datasets.
  • Figure 4: Qualitative comparison of rendering on a sequence of data. Our method preserves temporal and view consistency against 4D-GS and NoPoSplat.
  • Figure 5: Qualitative comparison of geometry. We show point maps of (a) DUSt3R, (b) VGGT, (c) Ours without pixel-wise translation, and (d) Our full affinity. Here is the point map reconstruction with corresponding pixels.
  • ...and 4 more figures