Table of Contents
Fetching ...

Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment

Youming Deng, Songyou Peng, Junyi Zhang, Kathryn Heal, Tiancheng Sun, John Flynn, Steve Marschner, Lucy Chai

TL;DR

Selfi presents a self-improving reconstruction pipeline that turns a Vision Foundation Model backbone into a high-fidelity 3D renderer by learning geometrically aligned features from unposed images using self-generated pseudo-ground-truth. The method uses a reprojection-based alignment loss to shape a geometry-aware feature space, enabling state-of-the-art NVS and pose estimation without 3D supervision, while a lightweight Gaussian head predicts 3D primitives and a depth shift with BA further refines poses and geometry. Key innovations include feature alignment over VGGT outputs, density-aware spherical harmonics for robust rendering, and an affine depth-shift to maintain consistency after pose updates. The results demonstrate strong performance on RealEstate10K and DL3DV, confirming that self-supervised geometric alignment can substantially improve downstream 3D reasoning for unposed data, with practical implications for robust, calibration-free NVS.

Abstract

Novel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structure-from-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach -- 3D knowledge is gained implicitly through training data and loss objectives, enabling feed-forward prediction of both camera parameters and 3D representations directly from a set of uncalibrated images. While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own outputs as pseudo-ground-truth. Specifically, we train a lightweight feature adapter using a reprojection-based consistency loss, which distills VGGT outputs into a new geometrically-aligned feature space that captures spatial proximity in 3D. This enables state-of-the-art performance in both NVS and camera pose estimation, demonstrating that feature alignment is a highly beneficial step for downstream 3D reasoning.

Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment

TL;DR

Selfi presents a self-improving reconstruction pipeline that turns a Vision Foundation Model backbone into a high-fidelity 3D renderer by learning geometrically aligned features from unposed images using self-generated pseudo-ground-truth. The method uses a reprojection-based alignment loss to shape a geometry-aware feature space, enabling state-of-the-art NVS and pose estimation without 3D supervision, while a lightweight Gaussian head predicts 3D primitives and a depth shift with BA further refines poses and geometry. Key innovations include feature alignment over VGGT outputs, density-aware spherical harmonics for robust rendering, and an affine depth-shift to maintain consistency after pose updates. The results demonstrate strong performance on RealEstate10K and DL3DV, confirming that self-supervised geometric alignment can substantially improve downstream 3D reasoning for unposed data, with practical implications for robust, calibration-free NVS.

Abstract

Novel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structure-from-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach -- 3D knowledge is gained implicitly through training data and loss objectives, enabling feed-forward prediction of both camera parameters and 3D representations directly from a set of uncalibrated images. While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own outputs as pseudo-ground-truth. Specifically, we train a lightweight feature adapter using a reprojection-based consistency loss, which distills VGGT outputs into a new geometrically-aligned feature space that captures spatial proximity in 3D. This enables state-of-the-art performance in both NVS and camera pose estimation, demonstrating that feature alignment is a highly beneficial step for downstream 3D reasoning.

Paper Structure

This paper contains 19 sections, 10 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Self Improving Reconstruction Engine. We introduce Selfi, a self-improving pipeline for novel view synthesis from unposed images. We start by learning geometrically aligned features using consistency losses and self-labelled pseudo ground truths from a 3D foundation model (e.g., VGGT wang2025vggt). These features can be used to predict Gaussian primitives kerbl20233d, and also refine initial poses via bundle adjustment. The improved poses are used to further adjust the initial 3D representation, resulting in an even higher quality final rendering.
  • Figure 2: Geometric Feature Alignment with Self-Labeled Pseudo-Ground-Truth. Using a pretrained VGGT wang2025vggt backbone, we use predicted depth and camera parameters as pseudo-ground-truth to align features obtained from a DPT adapter on top of VGGT image tokens. We sample query points and reproject these points to a target view using depth and camera parameters. Our loss function encourages the features at these two corresponding locations from source and target frames to be similar.
  • Figure 3: Qualitative Comparisons on DL3DV ling2024dl3dv. We visualize novel view renderings from AnySplat jiang2025anysplat, WorldMirror liu2025worldmirror, and our method. Our method successfully recovers thin structures, such as guardrails, and fine-grained details, such as the text "Holidays".
  • Figure 4: Bundle Adjustment with Depth Shift. (a) After refining the camera poses with bundle adjustment, naively rendering the predicted Gaussian primitives with the new poses results in misalignment. (b) Propagating the adjustments in sparse 3D points during BA to the dense depth maps results in improved rendering. (c) We plot the sparse point depths before and after BA, and observe that a linear fit suffices for this adjustment.
  • Figure 5: Qualitative Comparisons on RealEstate10K zhou2018stereo. We visualize novel view renderings from AnySplat jiang2025anysplat, WorldMirror liu2025worldmirror, and our model. Our method more faithfully reconstructs details such as the door hinge and the tiled wall.
  • ...and 7 more figures