Table of Contents
Fetching ...

From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis

Ranran Huang, Weixun Luo, Ye Mao, Krystian Mikolajczyk

Abstract

In this paper, we introduce NAS3R, a self-supervised feed-forward framework that jointly learns explicit 3D geometry and camera parameters with no ground-truth annotations and no pretrained priors. During training, NAS3R reconstructs 3D Gaussians from uncalibrated and unposed context views and renders target views using its self-predicted camera parameters, enabling self-supervised training from 2D photometric supervision. To ensure stable convergence, NAS3R integrates reconstruction and camera prediction within a shared transformer backbone regulated by masked attention, and adopts a depth-based Gaussian formulation that facilitates well-conditioned optimization. The framework is compatible with state-of-the-art supervised 3D reconstruction architectures and can incorporate pretrained priors or intrinsic information when available. Extensive experiments show that NAS3R achieves superior results to other self-supervised methods, establishing a scalable and geometry-aware paradigm for 3D reconstruction from unconstrained data. Code and models are publicly available at https://ranrhuang.github.io/nas3r/.

From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis

Abstract

In this paper, we introduce NAS3R, a self-supervised feed-forward framework that jointly learns explicit 3D geometry and camera parameters with no ground-truth annotations and no pretrained priors. During training, NAS3R reconstructs 3D Gaussians from uncalibrated and unposed context views and renders target views using its self-predicted camera parameters, enabling self-supervised training from 2D photometric supervision. To ensure stable convergence, NAS3R integrates reconstruction and camera prediction within a shared transformer backbone regulated by masked attention, and adopts a depth-based Gaussian formulation that facilitates well-conditioned optimization. The framework is compatible with state-of-the-art supervised 3D reconstruction architectures and can incorporate pretrained priors or intrinsic information when available. Extensive experiments show that NAS3R achieves superior results to other self-supervised methods, establishing a scalable and geometry-aware paradigm for 3D reconstruction from unconstrained data. Code and models are publicly available at https://ranrhuang.github.io/nas3r/.

Paper Structure

This paper contains 16 sections, 5 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: NAS3R is a self-supervised framework that requires no ground-truth annotations and no pretrained priors during training, and jointly infers 3D Gassian parameters, camera intrinsics and extrinsics and depth maps, also enabling high-quality novel view synthesis.
  • Figure 2: Training pipeline of NAS3R. Subscripts "$C$" and "$T$" denote context and target views, respectively. Unconstrained images are patchified into visual tokens and concatenated with a learnable camera token for camera prediction. A masked decoder regulates cross-view interactions and prevents target-to-context leakage. Refined context tokens are then processed by the Gaussian head to predict Gaussian parameters, while a depth head estimates depth maps that are lifted into 3D Gaussian centers using the predicted context poses. The predicted target poses are finally used to render novel views, providing photometric supervision for end-to-end training.
  • Figure 3: Comparison of NVS results across different methods. The leftmost column shows the two-view context images. From top to bottom, the settings are RE10K, RE10K$\rightarrow$ACID, RE10K$\rightarrow$DTU, and RE10K$\rightarrow$DL3DV.
  • Figure 4: Visual comparison of pose trajectories on RE10K. Camera frustums for ground-truth and predicted poses are shown in black and orange, respectively. The top two examples correspond to 5-view reconstruction, while the bottom example corresponds to 10-view reconstruction.
  • Figure 4: Comparison of two-view depth estimation results on BlendedMVS dataset.
  • ...and 5 more figures