Table of Contents
Fetching ...

GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views

Tianyu Chen, Wei Xiang, Kang Han, Yu Lu, Di Wu, Gaowen Liu, Ramana Rao Kompella

TL;DR

GIFSplat is introduced, a purely feed-forward iterative refinement framework for 3D Gaussian Splatting from sparse unposed views that consistently outperforms state-of-the-art feed-forward baselines, improving PSNR by up to +2.1 dB, and it maintains second-scale inference time without requiring camera poses or any test-time gradient optimization.

Abstract

Feed-forward 3D reconstruction offers substantial runtime advantages over per-scene optimization, which remains slow at inference and often fragile under sparse views. However, existing feed-forward methods still have potential for further performance gains, especially for out-of-domain data, and struggle to retain second-level inference time once a generative prior is introduced. These limitations stem from the one-shot prediction paradigm in existing feed-forward pipeline: models are strictly bounded by capacity, lack inference-time refinement, and are ill-suited for continuously injecting generative priors. We introduce GIFSplat, a purely feed-forward iterative refinement framework for 3D Gaussian Splatting from sparse unposed views. A small number of forward-only residual updates progressively refine current 3D scene using rendering evidence, achieve favorable balance between efficiency and quality. Furthermore, we distill a frozen diffusion prior into Gaussian-level cues from enhanced novel renderings without gradient backpropagation or ever-increasing view-set expansion, thereby enabling per-scene adaptation with generative prior while preserving feed-forward efficiency. Across DL3DV, RealEstate10K, and DTU, GIFSplat consistently outperforms state-of-the-art feed-forward baselines, improving PSNR by up to +2.1 dB, and it maintains second-scale inference time without requiring camera poses or any test-time gradient optimization.

GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views

TL;DR

GIFSplat is introduced, a purely feed-forward iterative refinement framework for 3D Gaussian Splatting from sparse unposed views that consistently outperforms state-of-the-art feed-forward baselines, improving PSNR by up to +2.1 dB, and it maintains second-scale inference time without requiring camera poses or any test-time gradient optimization.

Abstract

Feed-forward 3D reconstruction offers substantial runtime advantages over per-scene optimization, which remains slow at inference and often fragile under sparse views. However, existing feed-forward methods still have potential for further performance gains, especially for out-of-domain data, and struggle to retain second-level inference time once a generative prior is introduced. These limitations stem from the one-shot prediction paradigm in existing feed-forward pipeline: models are strictly bounded by capacity, lack inference-time refinement, and are ill-suited for continuously injecting generative priors. We introduce GIFSplat, a purely feed-forward iterative refinement framework for 3D Gaussian Splatting from sparse unposed views. A small number of forward-only residual updates progressively refine current 3D scene using rendering evidence, achieve favorable balance between efficiency and quality. Furthermore, we distill a frozen diffusion prior into Gaussian-level cues from enhanced novel renderings without gradient backpropagation or ever-increasing view-set expansion, thereby enabling per-scene adaptation with generative prior while preserving feed-forward efficiency. Across DL3DV, RealEstate10K, and DTU, GIFSplat consistently outperforms state-of-the-art feed-forward baselines, improving PSNR by up to +2.1 dB, and it maintains second-scale inference time without requiring camera poses or any test-time gradient optimization.
Paper Structure (25 sections, 9 equations, 7 figures, 5 tables)

This paper contains 25 sections, 9 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Conceptual comparison of reconstruction paradigms. Gradient optimization performs thousands updates, incurring heavy test-time cost, often achieving high quality in dense-view scenarios but struggling in sparse-view scenarios; One-shot feed-forward jiang2025anysplat is efficient but leaves noticeable artifacts; Our iterative residual feed-forward scheme keeps feed-forward efficiency and achieves higher reconstruction quality without test-time gradient backpropagation.
  • Figure 2: Overview of GIFSplat. Our framework consists of a Gaussian initializer, an iterative Gaussian head, and a generative prior fusion module. The initializer takes sparse input views and predicts camera parameters and initial 3DGS $g_0$. The iterative head then refines the Gaussians over several forward-only steps by updating per-Gaussian parameters $g_i$ using residual corrections $\Delta g$ predicted from the concatenated state and cues $\{g_i, \mathbf{o}_i, \mathbf{p}_i\}$. At each step, we render reference and novel views, compute observation evidence $\mathbf{o}_i$ from feature differences between input and rendered views, and derive generative prior cues $\mathbf{p}_i$ by enhancing the renderings with a frozen diffusion model DIFIX and taking feature-space residuals. These Gaussian-level signals are fed back into the iterative head to progressively improve the 3DGS, particularly in under-constrained regions.
  • Figure 3: Visualizing iterative residual refinement. Starting from the initial 3DGS prediction (left column), our iterative Gaussian head progressively refines geometry and appearance over three forward-only steps (S1–S3). The zoomed regions highlighted by red dashed boxes show reduced blur, sharper edges, and fewer artifacts as the iteration proceeds, illustrating how the proposed updates gradually improve the scene representation without test-time gradient backpropagation.
  • Figure 4: Generative prior fusion. Given sparse input views (left), our feed-forward 3DGS first produces a rendered view (middle). A frozen diffusion-based enhancer then refines this rendering into an enhanced image (right) with sharper textures and richer details, which is converted into Gaussian-level prior cues for subsequent residual updates.
  • Figure 5: Qualitative comparisons on representative indoor scenes from RealEstate10K. Columns: sparse input views, FLARE, AnySplat, our GIFSplat with generative prior, and ground truth (GT). Red dashed boxes highlight that GIFSplat recovers sharper boundaries (e.g., door frames, wall corners), more faithful textures, and fewer artifacts such as texture sticking.
  • ...and 2 more figures