Table of Contents
Fetching ...

YoNoSplat: You Only Need One Model for Feedforward 3D Gaussian Splatting

Botao Ye, Boqi Chen, Haofei Xu, Daniel Barath, Marc Pollefeys

TL;DR

YoNoSplat tackles flexible, fast 3D scene reconstruction from unposed image collections by predicting per-view local Gaussians and camera poses, then aggregating to a global scene. Its mix-forcing training mitigates pose–geometry entanglement, enabling robust performance in both pose-free and pose-dependent settings. An Intrinsic Condition Embedding (ICE) and max pairwise distance normalization resolve scale ambiguity and permit calibration-free inputs. The approach delivers state-of-the-art results on standard benchmarks, generalizes across datasets, and runs extremely fast, reconstructing 100-view scenes in a few seconds on a GH200 GPU, with practical gains from optional post-optimization.

Abstract

Fast and flexible 3D scene reconstruction from unstructured image collections remains a significant challenge. We present YoNoSplat, a feedforward model that reconstructs high-quality 3D Gaussian Splatting representations from an arbitrary number of images. Our model is highly versatile, operating effectively with both posed and unposed, calibrated and uncalibrated inputs. YoNoSplat predicts local Gaussians and camera poses for each view, which are aggregated into a global representation using either predicted or provided poses. To overcome the inherent difficulty of jointly learning 3D Gaussians and camera parameters, we introduce a novel mixing training strategy. This approach mitigates the entanglement between the two tasks by initially using ground-truth poses to aggregate local Gaussians and gradually transitioning to a mix of predicted and ground-truth poses, which prevents both training instability and exposure bias. We further resolve the scale ambiguity problem by a novel pairwise camera-distance normalization scheme and by embedding camera intrinsics into the network. Moreover, YoNoSplat also predicts intrinsic parameters, making it feasible for uncalibrated inputs. YoNoSplat demonstrates exceptional efficiency, reconstructing a scene from 100 views (at 280x518 resolution) in just 2.69 seconds on an NVIDIA GH200 GPU. It achieves state-of-the-art performance on standard benchmarks in both pose-free and pose-dependent settings. Our project page is at https://botaoye.github.io/yonosplat/.

YoNoSplat: You Only Need One Model for Feedforward 3D Gaussian Splatting

TL;DR

YoNoSplat tackles flexible, fast 3D scene reconstruction from unposed image collections by predicting per-view local Gaussians and camera poses, then aggregating to a global scene. Its mix-forcing training mitigates pose–geometry entanglement, enabling robust performance in both pose-free and pose-dependent settings. An Intrinsic Condition Embedding (ICE) and max pairwise distance normalization resolve scale ambiguity and permit calibration-free inputs. The approach delivers state-of-the-art results on standard benchmarks, generalizes across datasets, and runs extremely fast, reconstructing 100-view scenes in a few seconds on a GH200 GPU, with practical gains from optional post-optimization.

Abstract

Fast and flexible 3D scene reconstruction from unstructured image collections remains a significant challenge. We present YoNoSplat, a feedforward model that reconstructs high-quality 3D Gaussian Splatting representations from an arbitrary number of images. Our model is highly versatile, operating effectively with both posed and unposed, calibrated and uncalibrated inputs. YoNoSplat predicts local Gaussians and camera poses for each view, which are aggregated into a global representation using either predicted or provided poses. To overcome the inherent difficulty of jointly learning 3D Gaussians and camera parameters, we introduce a novel mixing training strategy. This approach mitigates the entanglement between the two tasks by initially using ground-truth poses to aggregate local Gaussians and gradually transitioning to a mix of predicted and ground-truth poses, which prevents both training instability and exposure bias. We further resolve the scale ambiguity problem by a novel pairwise camera-distance normalization scheme and by embedding camera intrinsics into the network. Moreover, YoNoSplat also predicts intrinsic parameters, making it feasible for uncalibrated inputs. YoNoSplat demonstrates exceptional efficiency, reconstructing a scene from 100 views (at 280x518 resolution) in just 2.69 seconds on an NVIDIA GH200 GPU. It achieves state-of-the-art performance on standard benchmarks in both pose-free and pose-dependent settings. Our project page is at https://botaoye.github.io/yonosplat/.

Paper Structure

This paper contains 22 sections, 3 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: YoNoSplat, a versatile feedforward model for rapid 3D reconstruction. Given an arbitrary number of unposed and uncalibrated input images covering a wide range of scenes, it predicts 3D Gaussians and can also utilize ground-truth camera poses or intrinsics when available.
  • Figure 2: Effect of different global Gaussian aggregation strategies during training. (a) Aggregating global Gaussians with predicted camera poses results in poor rendering quality because errors in pose estimation and Gaussian learning compound each other. (b) Using ground-truth poses introduces exposure bias (as indicated by the green arrow: training with ground-truth poses but testing with predicted poses causes misalignment of local Gaussians across different views). (c) Our mix-forcing training achieves high rendering quality in both pose-free and pose-dependent settings.
  • Figure 3: Overview of YoNoSplat. (a) Features are extracted with a DINOv2 encoder, followed by local-global attention across images, and finally used to predict camera poses and local 3D Gaussians. (b) The Intrinsic Condition Embedding (ICE) module predicts intrinsic parameters (i.e., focal length), which are then converted into camera rays and re-encoded as conditioning for Gaussian prediction, thereby resolving scale ambiguity.
  • Figure 4: Qualitative comparison on DL3DV dl3dv. Here we present our results in the pose-free, calibration-free setting, which still produce higher-quality novel view renderings compared to the pose-dependent method DepthSplat depthsplat.
  • Figure 5: Qualitative comparison on RealEstate10K re10k. Our pose-free, calibration-free method enables a more coherent fusion of multi-view contents.
  • ...and 4 more figures