Table of Contents
Fetching ...

G3Splat: Geometrically Consistent Generalizable Gaussian Splatting

Mehdi Hosseinzadeh, Shin-Fang Chng, Yi Xu, Simon Lucey, Ian Reid, Ravi Garg

TL;DR

<3-5 sentence high-level summary>G3Splat tackles geometric inconsistencies in generalizable Gaussian splatting under self-supervision by introducing explicit geometric priors. It enforces orientation alignment with local surface normals and pixel-ray alignment of Gaussians, integrated with both DUSt3R and VGGT backbones to produce pixel-aligned, geometrically coherent splats. Evaluations on RealEstate10K and zero-shot tests on ScanNet/ACID show state-of-the-art geometry, pose estimation, and novel-view synthesis, with robust depth and mesh reconstructions enabled by depth rendering via expected depth D_exp. The work includes comprehensive ablations and provides code and pretrained models to facilitate replication and further research in geometrically consistent 3D scene recovery from unposed views.

Abstract

3D Gaussians have recently emerged as an effective scene representation for real-time splatting and accurate novel-view synthesis, motivating several works to adapt multi-view structure prediction networks to regress per-pixel 3D Gaussians from images. However, most prior work extends these networks to predict additional Gaussian parameters -- orientation, scale, opacity, and appearance -- while relying almost exclusively on view-synthesis supervision. We show that a view-synthesis loss alone is insufficient to recover geometrically meaningful splats in this setting. We analyze and address the ambiguities of learning 3D Gaussian splats under self-supervision for pose-free generalizable splatting, and introduce G3Splat, which enforces geometric priors to obtain geometrically consistent 3D scene representations. Trained on RE10K, our approach achieves state-of-the-art performance in (i) geometrically consistent reconstruction, (ii) relative pose estimation, and (iii) novel-view synthesis. We further demonstrate strong zero-shot generalization on ScanNet, substantially outperforming prior work in both geometry recovery and relative pose estimation. Code and pretrained models are released on our project page (https://m80hz.github.io/g3splat/).

G3Splat: Geometrically Consistent Generalizable Gaussian Splatting

TL;DR

<3-5 sentence high-level summary>G3Splat tackles geometric inconsistencies in generalizable Gaussian splatting under self-supervision by introducing explicit geometric priors. It enforces orientation alignment with local surface normals and pixel-ray alignment of Gaussians, integrated with both DUSt3R and VGGT backbones to produce pixel-aligned, geometrically coherent splats. Evaluations on RealEstate10K and zero-shot tests on ScanNet/ACID show state-of-the-art geometry, pose estimation, and novel-view synthesis, with robust depth and mesh reconstructions enabled by depth rendering via expected depth D_exp. The work includes comprehensive ablations and provides code and pretrained models to facilitate replication and further research in geometrically consistent 3D scene recovery from unposed views.

Abstract

3D Gaussians have recently emerged as an effective scene representation for real-time splatting and accurate novel-view synthesis, motivating several works to adapt multi-view structure prediction networks to regress per-pixel 3D Gaussians from images. However, most prior work extends these networks to predict additional Gaussian parameters -- orientation, scale, opacity, and appearance -- while relying almost exclusively on view-synthesis supervision. We show that a view-synthesis loss alone is insufficient to recover geometrically meaningful splats in this setting. We analyze and address the ambiguities of learning 3D Gaussian splats under self-supervision for pose-free generalizable splatting, and introduce G3Splat, which enforces geometric priors to obtain geometrically consistent 3D scene representations. Trained on RE10K, our approach achieves state-of-the-art performance in (i) geometrically consistent reconstruction, (ii) relative pose estimation, and (iii) novel-view synthesis. We further demonstrate strong zero-shot generalization on ScanNet, substantially outperforming prior work in both geometry recovery and relative pose estimation. Code and pretrained models are released on our project page (https://m80hz.github.io/g3splat/).

Paper Structure

This paper contains 28 sections, 29 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: G3Splat enables geometrically consistent, pose-free generalizable Gaussian splatting across backbones.Left: our VGGT-based VGGT adaptation without / with the proposed priors. Right: our DUSt3R-based dust3r adaptation without / with the proposed priors. We visualize reconstructions on a Sora-generated video (150 input views) and RealEstate10K re10k (2 input views). Our priors encourage geometrically consistent Gaussians and markedly reduce floating artifacts. Sora prompt: "Generate a video inside the Louvre Museum, including the paintings."
  • Figure 2: Qualitative comparison of predicted Gaussian parameters. For visualization, we denote by $(s_1,s_2,s_3)$ the sorted eigen-scales of each Gaussian covariance in \ref{['eq:covariance']}, such that $s_1 \ge s_2 \ge s_3$; the smallest scale $s_3$ characterizes uncertainty along the surface normal direction. Row 1 (ours) shows: (a) the source image to which Gaussians are aligned, (b) skewness of the estimated Gaussians within their defining plane, and (c) predicted Gaussian orientations visualized as surface-normal maps. Rows 2 and 3 show results for NoPoSplat noposplat and MVSplat mvsplat, respectively: (d/g) Gaussians' elongation perpendicular to the dominant plane defined by it, (e/h) Gaussians' skewness within the dominant plane, and (f/i) normals to the dominant plane. Existing methods yield Gaussian orientations without clear geometric meaning: MVSplat Gaussians (i) align mostly fronto-parallel to the source image plane, and NoPoSplat Gaussians orientations (f) strongly depend on texture, spanning a few dominant directions inconsistent with scene geometry. Our method produces plausible, near-Manhattan structured surface orientations. Baseline Gaussians exhibit significant elongation perpendicular to their dominant surfaces (visible as non-red colors in d/g). Notably, our Gaussians remain relatively circular (blue color in b) on planar, textureless surfaces and become skewed ellipses (red color in b) near sharp geometric edges such as shelves or wall corners.
  • Figure 3: Qualitative comparison of rendered novel-view depth on RE10K (first row), ACID (second row), and ScanNet (last row).
  • Figure 4: Qualitative ablation of reconstructed meshes on ScanNet scannet (2 input views) using VGGT VGGT and DUSt3R dust3r backbones. Our proposed priors consistently yield sharper, more complete, and less noisy geometry across both backbones.
  • Figure 5: Qualitative ablation of reconstructed Gaussians on a Sora-generated video (VGGT backbone, 24 input views). Prompt used to generate the video: "A single unbroken orbital camera move through a vast, empty gothic library, with static architecture, medium-wide framing, warm steady lighting, and crisp sharp geometric details".
  • ...and 7 more figures