Table of Contents
Fetching ...

SplatFormer: Point Transformer for Robust 3D Gaussian Splatting

Yutong Chen, Marko Mihajlovic, Xiyi Chen, Yiming Wang, Sergey Prokudin, Siyu Tang

TL;DR

This work targets robust novel view synthesis under out-of-distribution camera angles by introducing SplatFormer, a point-transformer that refines an initial 3D Gaussian Splatting (3DGS) representation in one forward pass. The model learns 3D priors from large-scale ShapeNet and Objaverse datasets and is trained with a 2D rendering loss that combines $\mathcal{L}_1$ and perceptual components, $\mathcal{L}_{\text{LPIPS}}$, over both in-distribution and OOD views. A new evaluation protocol, OOD-NVS, reveals that prior methods struggle with extreme viewpoint deviations, while SplatFormer achieves state-of-the-art fidelity and 3D consistency in both synthetic and real-world cross-dataset scenarios. The results emphasize the efficacy of applying a 3D point-transformer to Gaussian splats and highlight the practical impact for immersive AR/VR rendering where unseen viewpoints are common. Overall, the work demonstrates that data-driven priors and 3D-consistent refinement via transformers can substantially improve OOD renderings while maintaining real-time capabilities.

Abstract

3D Gaussian Splatting (3DGS) has recently transformed photorealistic reconstruction, achieving high visual fidelity and real-time performance. However, rendering quality significantly deteriorates when test views deviate from the camera angles used during training, posing a major challenge for applications in immersive free-viewpoint rendering and navigation. In this work, we conduct a comprehensive evaluation of 3DGS and related novel view synthesis methods under out-of-distribution (OOD) test camera scenarios. By creating diverse test cases with synthetic and real-world datasets, we demonstrate that most existing methods, including those incorporating various regularization techniques and data-driven priors, struggle to generalize effectively to OOD views. To address this limitation, we introduce SplatFormer, the first point transformer model specifically designed to operate on Gaussian splats. SplatFormer takes as input an initial 3DGS set optimized under limited training views and refines it in a single forward pass, effectively removing potential artifacts in OOD test views. To our knowledge, this is the first successful application of point transformers directly on 3DGS sets, surpassing the limitations of previous multi-scene training methods, which could handle only a restricted number of input views during inference. Our model significantly improves rendering quality under extreme novel views, achieving state-of-the-art performance in these challenging scenarios and outperforming various 3DGS regularization techniques, multi-scene models tailored for sparse view synthesis, and diffusion-based frameworks.

SplatFormer: Point Transformer for Robust 3D Gaussian Splatting

TL;DR

This work targets robust novel view synthesis under out-of-distribution camera angles by introducing SplatFormer, a point-transformer that refines an initial 3D Gaussian Splatting (3DGS) representation in one forward pass. The model learns 3D priors from large-scale ShapeNet and Objaverse datasets and is trained with a 2D rendering loss that combines and perceptual components, , over both in-distribution and OOD views. A new evaluation protocol, OOD-NVS, reveals that prior methods struggle with extreme viewpoint deviations, while SplatFormer achieves state-of-the-art fidelity and 3D consistency in both synthetic and real-world cross-dataset scenarios. The results emphasize the efficacy of applying a 3D point-transformer to Gaussian splats and highlight the practical impact for immersive AR/VR rendering where unseen viewpoints are common. Overall, the work demonstrates that data-driven priors and 3D-consistent refinement via transformers can substantially improve OOD renderings while maintaining real-time capabilities.

Abstract

3D Gaussian Splatting (3DGS) has recently transformed photorealistic reconstruction, achieving high visual fidelity and real-time performance. However, rendering quality significantly deteriorates when test views deviate from the camera angles used during training, posing a major challenge for applications in immersive free-viewpoint rendering and navigation. In this work, we conduct a comprehensive evaluation of 3DGS and related novel view synthesis methods under out-of-distribution (OOD) test camera scenarios. By creating diverse test cases with synthetic and real-world datasets, we demonstrate that most existing methods, including those incorporating various regularization techniques and data-driven priors, struggle to generalize effectively to OOD views. To address this limitation, we introduce SplatFormer, the first point transformer model specifically designed to operate on Gaussian splats. SplatFormer takes as input an initial 3DGS set optimized under limited training views and refines it in a single forward pass, effectively removing potential artifacts in OOD test views. To our knowledge, this is the first successful application of point transformers directly on 3DGS sets, surpassing the limitations of previous multi-scene training methods, which could handle only a restricted number of input views during inference. Our model significantly improves rendering quality under extreme novel views, achieving state-of-the-art performance in these challenging scenarios and outperforming various 3DGS regularization techniques, multi-scene models tailored for sparse view synthesis, and diffusion-based frameworks.

Paper Structure

This paper contains 15 sections, 7 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: We investigate the out-of-distribution (OOD) novel view synthesis (NVS), where test views significantly differ from input views. This scenario contrasts with prior in-distribution NVS, where test views interpolate between densely captured input views, Sparse NVS with a few large-baseline input views, and Nerfbusters NVSNerfbusters2023, where test views share similar angles with input views. Existing NVS methods, including MipNeRF360 barron2022mipnerf360, and those designed for sparse inputs like LaRa LaRa, face challenges in this setting, while our method shows notable improvements.
  • Figure 2: Limitations of 3DGS in OOD-NVS setup. We observe that the quality of novel views obtained via 3DGS significantly deteriorates as the test camera deviates from the distribution of input camera views which our solution, SplatFormer, effectively overcomes and demonstrates higher fidelity renderings. The displayed metric (left) is performed on the scenes from Objaverse deitke2022objaverseuniverseannotated3d; see Sec. \ref{['sec:Experiments']} for detailed experiment setup.
  • Figure 3: Method Overview. We introduce SplatFormer, a generalizable 3D point transformer network designed for feed-forward refinement of Gaussian splats, enabling robust out-of-distribution novel-view synthesis (OOD-NVS). The reconstruction process begins by generating an initial set of 3D Gaussians from input images. However, these splats are biased toward the input views and are not robust for OOD-NVS. SplatFormer refines these splats through a hierarchical neural network that models residuals to the initial splat attributes. The model is trained on a large collection of 3D shapes using 2D rendering loss, allowing it to: 1) incorporate spatial regularity among splat primitives via the hierarchical architecture, 2) leverage generic priors from large-scale datasets, and 3) ensure 3D consistency through refining 3D primitives directly.
  • Figure 4: Novel View Synthesis under Out-of-Distribution Camera Angles. The first column shows 4 out of 32 input views. Here, we compare our method with LaRa LaRa, SplatFields SplatFields, MipNeRF360 barron2022mipnerf360, 2DGS huang20242dSplat_Tuebingen, and 3DGS kerbl3dgs. Results on Objaverse-OOD evaluation scenes; a comprehensive comparison with all the baselines is provided in the appendix (Fig. \ref{['fig:exp_full_objaverse']}).
  • Figure 5: Cross-dataset Generalization. SplatFormer trained on Objaverse successfully mitigates artifacts in OOD views in the GSO gso dataset and our real-world object-centric captures. Additional results are presented in the appendix (Fig. \ref{['fig:exp_full_gso']} and Fig. \ref{['fig:sup_real']}).
  • ...and 10 more figures