Beyond Existance: Fulfill 3D Reconstructed Scenes with Pseudo Details
Yifei Gao, Jun Huang, Lei Wang, Ruiting Dai, Jun Cheng
TL;DR
This work tackles zoom-in artifacts in 3D Gaussian Splatting (3D-GS) arising from training sampling gaps. It introduces a hybrid training pipeline that couples Bootstrap-GS with upscale diffusion models (Latent Diffusion Models) to generate high-frequency, scene-consistent pseudo-ground-truth while employing a frequency-aware interpolation and filtering strategy. The final loss blends original 3D-GS objectives with bootstrapping and upscaling terms, $\mathcal{L}_h = (1-\lambda_{\text{boot}}-\lambda_{\text{up}})\mathcal{L}_o + \mathcal{L}_{b} + \mathcal{L}_{u}$, and uses $\mathcal{L}_{b} = \frac{\lambda_{\text{boot}}}{N} \sum_{i \in N} \mathcal{L}^i_b$ to reinforce multi-view consistency. Empirically, the approach yields state-of-the-art results across diverse datasets (Mip-NeRF360, Tanks & Temples, DeepBlending, BungeeNeRF) with faster training than prior bootstrapping pipelines and improved generalization to out-of-distribution poses, while mitigating zoom-in artifacts and enriching fine details in zoomed views.
Abstract
The emergence of 3D Gaussian Splatting (3D-GS) has significantly advanced 3D reconstruction by providing high fidelity and fast training speeds across various scenarios. While recent efforts have mainly focused on improving model structures to compress data volume or reduce artifacts during zoom-in and zoom-out operations, they often overlook an underlying issue: training sampling deficiency. In zoomed-in views, Gaussian primitives can appear unregulated and distorted due to their dilation limitations and the insufficient availability of scale-specific training samples. Consequently, incorporating pseudo-details that ensure the completeness and alignment of the scene becomes essential. In this paper, we introduce a new training method that integrates diffusion models and multi-scale training using pseudo-ground-truth data. This approach not only notably mitigates the dilation and zoomed-in artifacts but also enriches reconstructed scenes with precise details out of existing scenarios. Our method achieves state-of-the-art performance across various benchmarks and extends the capabilities of 3D reconstruction beyond training datasets.
