Table of Contents
Fetching ...

PreF3R: Pose-Free Feed-Forward 3D Gaussian Splatting from Variable-length Image Sequence

Zequn Chen, Jiezhi Yang, Heng Yang

TL;DR

PreF3R advances pose-free, feed-forward 3D reconstruction by reconstructing a global 3D Gaussian field from variable-length unposed image sequences in a canonical frame. It extends a pairwise reconstruction model with a spatial memory network to handle multi-view inputs without global optimization, and adds a dense Gaussian parameter head for differentiable rasterization, enabling real-time novel-view synthesis. The approach achieves ~20 FPS online reconstruction and supports rapid, photorealistic rendering with strong generalization across unseen scenes. This yields a practical, end-to-end pipeline for real-time 3D content creation from unposed data, with competitive rendering quality and robust scalability.

Abstract

We present PreF3R, Pose-Free Feed-forward 3D Reconstruction from an image sequence of variable length. Unlike previous approaches, PreF3R removes the need for camera calibration and reconstructs the 3D Gaussian field within a canonical coordinate frame directly from a sequence of unposed images, enabling efficient novel-view rendering. We leverage DUSt3R's ability for pair-wise 3D structure reconstruction, and extend it to sequential multi-view input via a spatial memory network, eliminating the need for optimization-based global alignment. Additionally, PreF3R incorporates a dense Gaussian parameter prediction head, which enables subsequent novel-view synthesis with differentiable rasterization. This allows supervising our model with the combination of photometric loss and pointmap regression loss, enhancing both photorealism and structural accuracy. Given a sequence of ordered images, PreF3R incrementally reconstructs the 3D Gaussian field at 20 FPS, therefore enabling real-time novel-view rendering. Empirical experiments demonstrate that PreF3R is an effective solution for the challenging task of pose-free feed-forward novel-view synthesis, while also exhibiting robust generalization to unseen scenes.

PreF3R: Pose-Free Feed-Forward 3D Gaussian Splatting from Variable-length Image Sequence

TL;DR

PreF3R advances pose-free, feed-forward 3D reconstruction by reconstructing a global 3D Gaussian field from variable-length unposed image sequences in a canonical frame. It extends a pairwise reconstruction model with a spatial memory network to handle multi-view inputs without global optimization, and adds a dense Gaussian parameter head for differentiable rasterization, enabling real-time novel-view synthesis. The approach achieves ~20 FPS online reconstruction and supports rapid, photorealistic rendering with strong generalization across unseen scenes. This yields a practical, end-to-end pipeline for real-time 3D content creation from unposed data, with competitive rendering quality and robust scalability.

Abstract

We present PreF3R, Pose-Free Feed-forward 3D Reconstruction from an image sequence of variable length. Unlike previous approaches, PreF3R removes the need for camera calibration and reconstructs the 3D Gaussian field within a canonical coordinate frame directly from a sequence of unposed images, enabling efficient novel-view rendering. We leverage DUSt3R's ability for pair-wise 3D structure reconstruction, and extend it to sequential multi-view input via a spatial memory network, eliminating the need for optimization-based global alignment. Additionally, PreF3R incorporates a dense Gaussian parameter prediction head, which enables subsequent novel-view synthesis with differentiable rasterization. This allows supervising our model with the combination of photometric loss and pointmap regression loss, enhancing both photorealism and structural accuracy. Given a sequence of ordered images, PreF3R incrementally reconstructs the 3D Gaussian field at 20 FPS, therefore enabling real-time novel-view rendering. Empirical experiments demonstrate that PreF3R is an effective solution for the challenging task of pose-free feed-forward novel-view synthesis, while also exhibiting robust generalization to unseen scenes.

Paper Structure

This paper contains 36 sections, 8 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of PreF3R. Given a sequence of unposed images of variable length, PreF3R incrementally reconstructs a set of 3D Gaussian primitives in a single feed-forward pass without any pre-processing or intermediate pose estimation. PreF3R operates at 20 FPS on a single H100 GPU, enabling real-time novel-view synthesis from numerous input images through differentiable rasterization.
  • Figure 2: PreF3R's overall architecture. Left: An ordered set of unposed images $\{I_t\}_{t=1}^T$ is fed into PreF3R sequentially. Middle: At timestamp $t$, the input frame $I_t$ is first encoded by a ViT-encoder into $f_t$, which is then decoded into the query feature $f_t^q$ by the Target Decoder. The Target Decoder is intertwined with the Reference Decoder through cross-attention. Simultaneously, the query feature of the previous frame $f_{t-1}^q$ queries the memory bank to produce the fused feature $f_{t-1}^g$, which the Reference Decoder decodes into the output feature $f_{t-1}^h$. $f_{t-1}^h$ is then processed by the Gaussian Head and the Point Head to produce pixel-aligned Gaussian primitives. Right: The output from each frame is accumulated into global Gaussian primitives, enabling fast novel-view synthesis through rasterization.
  • Figure 3: Scale ambiguity problem. Even slight scale shifts can cause significant view drifts in rendered results from ground-truth camera poses, making it hard to apply photometric supervision. Top row: ground-truth images; Bottom row: rendered images. Data sample is from Co3D reizenstein2021common.
  • Figure 4: Qualitative comparison of novel view synthesis performance.Left: visualization of scene reconstructions from ARKitScenes baruch1arkitscenes; Right: visualization of reconstructions from ScanNet++ yeshwanthliu2023scannetpp. Each row corresponds to a unique viewpoint, while each column displays the output of a different method. Note that MVSplat chen2024mvsplat relies on ground truth poses, and InstantSplat fan2024instantsplatsparseviewsfmfreegaussian requires per-scene optimization, whereas PreF3R requires neither. PreF3R achieves comparable or superior photorealism and demonstrates better structural accuracy relative to the other methods.
  • Figure 5: PreF3R performs incremental Gaussian reconstruction in real-time.Left: in-domain scene reconstruction from ScanNet++ yeshwanthliu2023scannetpp; Right: out-of-domain scene reconstruction from Tanks and Temples Knapitsch2017.
  • ...and 2 more figures