Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment
Youming Deng, Songyou Peng, Junyi Zhang, Kathryn Heal, Tiancheng Sun, John Flynn, Steve Marschner, Lucy Chai
TL;DR
Selfi presents a self-improving reconstruction pipeline that turns a Vision Foundation Model backbone into a high-fidelity 3D renderer by learning geometrically aligned features from unposed images using self-generated pseudo-ground-truth. The method uses a reprojection-based alignment loss to shape a geometry-aware feature space, enabling state-of-the-art NVS and pose estimation without 3D supervision, while a lightweight Gaussian head predicts 3D primitives and a depth shift with BA further refines poses and geometry. Key innovations include feature alignment over VGGT outputs, density-aware spherical harmonics for robust rendering, and an affine depth-shift to maintain consistency after pose updates. The results demonstrate strong performance on RealEstate10K and DL3DV, confirming that self-supervised geometric alignment can substantially improve downstream 3D reasoning for unposed data, with practical implications for robust, calibration-free NVS.
Abstract
Novel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structure-from-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach -- 3D knowledge is gained implicitly through training data and loss objectives, enabling feed-forward prediction of both camera parameters and 3D representations directly from a set of uncalibrated images. While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own outputs as pseudo-ground-truth. Specifically, we train a lightweight feature adapter using a reprojection-based consistency loss, which distills VGGT outputs into a new geometrically-aligned feature space that captures spatial proximity in 3D. This enables state-of-the-art performance in both NVS and camera pose estimation, demonstrating that feature alignment is a highly beneficial step for downstream 3D reasoning.
