Table of Contents
Fetching ...

GGPT: Geometry Grounded Point Transformer

Yutong Chen, Yiming Wang, Xucong Zhang, Sergey Prokudin, Siyu Tang

Abstract

Recent feed-forward networks have achieved remarkable progress in sparse-view 3D reconstruction by predicting dense point maps directly from RGB images. However, they often suffer from geometric inconsistencies and limited fine-grained accuracy due to the absence of explicit multi-view constraints. We introduce the Geometry-Grounded Point Transformer (GGPT), a framework that augments feed-forward reconstruction with reliable sparse geometric guidance. We first propose an improved Structure-from-Motion pipeline based on dense feature matching and lightweight geometric optimisation to efficiently estimate accurate camera poses and partial 3D point clouds from sparse input views. Building on this foundation, we propose a geometry-guided 3D point transformer that refines dense point maps under explicit partial-geometry supervision using an optimised guidance encoding. Extensive experiments demonstrate that our method provides a principled mechanism for integrating geometric priors with dense feed-forward predictions, producing reconstructions that are both geometrically consistent and spatially complete, recovering fine structures and filling gaps in textureless areas. Trained solely on ScanNet++ with VGGT predictions, GGPT generalises across architectures and datasets, substantially outperforming state-of-the-art feed-forward 3D reconstruction models in both in-domain and out-of-domain settings.

GGPT: Geometry Grounded Point Transformer

Abstract

Recent feed-forward networks have achieved remarkable progress in sparse-view 3D reconstruction by predicting dense point maps directly from RGB images. However, they often suffer from geometric inconsistencies and limited fine-grained accuracy due to the absence of explicit multi-view constraints. We introduce the Geometry-Grounded Point Transformer (GGPT), a framework that augments feed-forward reconstruction with reliable sparse geometric guidance. We first propose an improved Structure-from-Motion pipeline based on dense feature matching and lightweight geometric optimisation to efficiently estimate accurate camera poses and partial 3D point clouds from sparse input views. Building on this foundation, we propose a geometry-guided 3D point transformer that refines dense point maps under explicit partial-geometry supervision using an optimised guidance encoding. Extensive experiments demonstrate that our method provides a principled mechanism for integrating geometric priors with dense feed-forward predictions, producing reconstructions that are both geometrically consistent and spatially complete, recovering fine structures and filling gaps in textureless areas. Trained solely on ScanNet++ with VGGT predictions, GGPT generalises across architectures and datasets, substantially outperforming state-of-the-art feed-forward 3D reconstruction models in both in-domain and out-of-domain settings.
Paper Structure (29 sections, 23 equations, 12 figures, 10 tables)

This paper contains 29 sections, 23 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Top row: the dense point maps predicted by feed-forward methods (e.g. VGGT wang2025vggt) struggle with multi-view geometric consistency resulting in large error. Bottom row: with geometric guidance, our GGPT refines the dense point maps to enhance global alignment and 3D consistency, substantially reducing the reconstruction error.
  • Figure 2: Overview of our method. We utilise off-the-shelf feed-forward multi-view transformers wang2025vggtkeetha2025mapanything and dense matchers edstedt2024romazhang2025ufm to predict dense point maps $\{\mathbf{P}_i\!\in\!\mathbb{R}^{H\times W\times 3}\}_{i=1}^N$, camera parameters $\{\mathbf{g}_i\}_{i=1}^N$, and multi-view correspondences $\mathbf{T}\!\in\!\mathbb{R}^{N\times N\times H\times W\times 2}$. These correspondences are used for sparse bundle adjustment and multi-view DLT triangulation, producing a geometrically consistent yet incomplete sparse cloud $\mathbf{X}_s$ via a lightweight alternative to standard SfM pipelines (Section \ref{['subsec:fastba']}). The dense yet multi-view inconsistent point cloud $\mathbf{X}_d$, obtained by combining the predicted per-view point maps, exhibits local misalignments (red boxes) that are then refined under the geometric guidance of $\mathbf{X}_s$ by our Geometry-Grounded Point Transformer (GGPT) (Section \ref{['subsec:ptv3']}), a modified point transformer wu2024ptv3 equipped with specialised geometric embeddings $\mathrm{PE}(\mathbf{x}_{d\rightarrow s})$ and $\Delta_{d\rightarrow s}$, which encode spatial relations between dense and sparse points to provide stable geometric guidance. The final output is a refined, globally aligned, and geometry-consistent dense point cloud $\hat{\mathbf{X}}_d$.
  • Figure 3: 3D reconstruction results on T&T Knapitsch2017tandt and ETH3D schoeps2017eth3d. Points with confidence above the 90% quantile are visualised. We compare the error maps of the reconstruction before and after our refinement. In the zoomed-in regions, the ground truth (colored by input RGBs) is overlaid with the predictions (colored by error) to highlight how our method corrects the misalignment of input points.
  • Figure 4: Comparison between SfMs on ETH3D schoeps2017eth3d. Across 4/8/16-view setups, our SfM pipeline achieves consistently better camera pose accuracy, points accuracy, and good points completeness, while retaining the shortest running time.
  • Figure 5: Examples on out-of-domain 4D-DRESS wang20244ddress with visualisation (top row), and MV-dVRKdvrk-smv with both visualisation and reconstruction error (bottom row), predicted by feed-forward networks and enhanced by our method. Our GGPT can significantly improve the multi-view consistency in the reconstruction result.
  • ...and 7 more figures