Table of Contents
Fetching ...

No Pose at All: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views

Ranran Huang, Krystian Mikolajczyk

TL;DR

SPFSplat tackles sparse-view novel view synthesis without ground-truth camera poses by jointly predicting 3D Gaussian primitives and relative poses in a canonical space using a shared ViT backbone. It introduces a rendering loss plus a pixel-wise reprojection constraint to stabilize training and enhance geometric alignment, enabling end-to-end, feed-forward pose-free learning. The approach achieves state-of-the-art NVS performance, strong pose estimation, and robust zero-shot cross-dataset generalization, while maintaining real-time inference speeds. This makes pose-free 3D Gaussian splatting practical for scalable real-world applications without pose annotations.

Abstract

We introduce SPFSplat, an efficient framework for 3D Gaussian splatting from sparse multi-view images, requiring no ground-truth poses during training or inference. It employs a shared feature extraction backbone, enabling simultaneous prediction of 3D Gaussian primitives and camera poses in a canonical space from unposed inputs within a single feed-forward step. Alongside the rendering loss based on estimated novel-view poses, a reprojection loss is integrated to enforce the learning of pixel-aligned Gaussian primitives for enhanced geometric constraints. This pose-free training paradigm and efficient one-step feed-forward design make SPFSplat well-suited for practical applications. Remarkably, despite the absence of pose supervision, SPFSplat achieves state-of-the-art performance in novel view synthesis even under significant viewpoint changes and limited image overlap. It also surpasses recent methods trained with geometry priors in relative pose estimation. Code and trained models are available on our project page: https://ranrhuang.github.io/spfsplat/.

No Pose at All: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views

TL;DR

SPFSplat tackles sparse-view novel view synthesis without ground-truth camera poses by jointly predicting 3D Gaussian primitives and relative poses in a canonical space using a shared ViT backbone. It introduces a rendering loss plus a pixel-wise reprojection constraint to stabilize training and enhance geometric alignment, enabling end-to-end, feed-forward pose-free learning. The approach achieves state-of-the-art NVS performance, strong pose estimation, and robust zero-shot cross-dataset generalization, while maintaining real-time inference speeds. This makes pose-free 3D Gaussian splatting practical for scalable real-world applications without pose annotations.

Abstract

We introduce SPFSplat, an efficient framework for 3D Gaussian splatting from sparse multi-view images, requiring no ground-truth poses during training or inference. It employs a shared feature extraction backbone, enabling simultaneous prediction of 3D Gaussian primitives and camera poses in a canonical space from unposed inputs within a single feed-forward step. Alongside the rendering loss based on estimated novel-view poses, a reprojection loss is integrated to enforce the learning of pixel-aligned Gaussian primitives for enhanced geometric constraints. This pose-free training paradigm and efficient one-step feed-forward design make SPFSplat well-suited for practical applications. Remarkably, despite the absence of pose supervision, SPFSplat achieves state-of-the-art performance in novel view synthesis even under significant viewpoint changes and limited image overlap. It also surpasses recent methods trained with geometry priors in relative pose estimation. Code and trained models are available on our project page: https://ranrhuang.github.io/spfsplat/.

Paper Structure

This paper contains 14 sections, 6 equations, 15 figures, 12 tables.

Figures (15)

  • Figure 1: Comparison of three training pipelines for sparse-view 3D scene reconstruction in novel view synthesis. For simplicity, the image rendering loss on the rendered target view is omitted. Our self-supervised pose-free pipeline estimates target-view poses to optimize 3D scene representations reconstructed from unposed images, thereby eliminating the reliance on ground-truth poses during training.
  • Figure 2: Training pipeline of SPFSplat. Three specialized heads are integrated into a shared ViT backbone, simultaneously predicting Gaussian centers, additional Gaussian parameters, and camera poses from unposed images in a canonical space, where the first input view serves as the reference. Only the context-only branch (above) is used during inference, while the context-with-target branch (below) is employed exclusively during training to estimate target poses, which are used for rendering loss supervision. Additionally, a reprojection loss enforces alignment between Gaussian centers and their corresponding pixels, using the estimated context poses from both branches. Our method jointly optimizes 3D Gaussians and poses, improving geometric consistency and reconstruction quality.
  • Figure 3: Qualitative comparison on RE10K (top three rows) and ACID (bottom row). Compared to baselines, our method 1) reduces misaligned blending artifacts and ghosting effects, 2) better handles extreme viewpoint changes and texture-less areas (e.g. window), and 3) preserves overall scene geometry (e.g. bridge) and finer details (e.g. swimming pool).
  • Figure 4: Cross-dataset generalization. Some failure regions are highlighted by red rectangles for visual reference.
  • Figure 5: Comparison of 3D Gaussians and rendered results. Input and target camera poses are shown in red and green, respectively. Rendered images and depth maps are displayed on the right. Our method produces higher-quality 3D Gaussians and achieves superior rendering compared to baseline methods. Some regions with distorted or incorrect geometry are highlighted with red arrows.
  • ...and 10 more figures