Table of Contents
Fetching ...

AirSplat: Alignment and Rating for Robust Feed-Forward 3D Gaussian Splatting

Minh-Quan Viet Bui, Jaeho Moon, Munchurl Kim

Abstract

While 3D Vision Foundation Models (3DVFMs) have demonstrated remarkable zero-shot capabilities in visual geometry estimation, their direct application to generalizable novel view synthesis (NVS) remains challenging. In this paper, we propose AirSplat, a novel training framework that effectively adapts the robust geometric priors of 3DVFMs into high-fidelity, pose-free NVS. Our approach introduces two key technical contributions: (1) Self-Consistent Pose Alignment (SCPA), a training-time feedback loop that ensures pixel-aligned supervision to resolve pose-geometry discrepancy; and (2) Rating-based Opacity Matching (ROM), which leverages the local 3D geometry consistency knowledge from a sparse-view NVS teacher model to filter out degraded primitives. Experimental results on large-scale benchmarks demonstrate that our method significantly outperforms state-of-the-art pose-free NVS approaches in reconstruction quality. Our AirSplat highlights the potential of adapting 3DVFMs to enable simultaneous visual geometry estimation and high-quality view synthesis.

AirSplat: Alignment and Rating for Robust Feed-Forward 3D Gaussian Splatting

Abstract

While 3D Vision Foundation Models (3DVFMs) have demonstrated remarkable zero-shot capabilities in visual geometry estimation, their direct application to generalizable novel view synthesis (NVS) remains challenging. In this paper, we propose AirSplat, a novel training framework that effectively adapts the robust geometric priors of 3DVFMs into high-fidelity, pose-free NVS. Our approach introduces two key technical contributions: (1) Self-Consistent Pose Alignment (SCPA), a training-time feedback loop that ensures pixel-aligned supervision to resolve pose-geometry discrepancy; and (2) Rating-based Opacity Matching (ROM), which leverages the local 3D geometry consistency knowledge from a sparse-view NVS teacher model to filter out degraded primitives. Experimental results on large-scale benchmarks demonstrate that our method significantly outperforms state-of-the-art pose-free NVS approaches in reconstruction quality. Our AirSplat highlights the potential of adapting 3DVFMs to enable simultaneous visual geometry estimation and high-quality view synthesis.

Paper Structure

This paper contains 24 sections, 10 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Our proposed AirSplat adapts 3D-VS-VFMs using Self-Consistent Pose Alignment (SCPA) and Rating-based Opacity Matching (ROM) to resolve inherent pose-geometry discrepancies and multi-view inconsistencies. Our AirSplat effectively eliminates 'floaters' (red boxes) and blurry artifacts (dashed yellow boxes) produced by the baseline DA3 lin2025da3, rendering sharp, structurally consistent novel views.
  • Figure 2: Overview of our AirSplat training pipeline. During training of a Gaussian head of a 3DVFM, the encoder and its pose head are frozen. Our SCPA module corrects the predicted target pose to align the rendered target views to the GT target views. Our ROM module gathers the geometric feedback from the teacher model, and based on the feedback, it enhances the multi-view consistency of the predicted 3D primitives.
  • Figure 3: Comparison of training paradigms in pose-free NVS. (a) Training with the context-only strategy, adopted in jiang2025anysplat, leads to a lack of direct supervision for novel viewpoints. (b) The context-target strategy, following huang2025no, results in spatial misalignment. (c) Our SCPA corrects the inherent spatial drift, enabling the network to learn both structurally consistent 3D geometry and robust novel view synthesis.
  • Figure 4: Effect of Self-Consistent Pose Alignment (SCPA). We compare rendered target views using the initial predicted pose $\hat{\bm{P}}^{(1)}_{\text{tgt},t}$, our aligned pose $\hat{\bm{P}}_{\text{tgt},t}^\text{align}$ against the ground truth, during training. As highlighted by the red arrows, the initial pose prediction results in a noticeable spatial shift, evident in the misaligned structural lines on the ground. Our proposed SCPA corrects the inherent spatial drift and ensures that the model learn structurally consistent 3D geometry and robust NVS. In the error maps, blue indicates small errors, and red indicates large errors.
  • Figure 5: Qualitative comparison of NVS performance on RE10K dataset zhou2018stereo.
  • ...and 6 more figures