Table of Contents
Fetching ...

Beyond Existance: Fulfill 3D Reconstructed Scenes with Pseudo Details

Yifei Gao, Jun Huang, Lei Wang, Ruiting Dai, Jun Cheng

TL;DR

This work tackles zoom-in artifacts in 3D Gaussian Splatting (3D-GS) arising from training sampling gaps. It introduces a hybrid training pipeline that couples Bootstrap-GS with upscale diffusion models (Latent Diffusion Models) to generate high-frequency, scene-consistent pseudo-ground-truth while employing a frequency-aware interpolation and filtering strategy. The final loss blends original 3D-GS objectives with bootstrapping and upscaling terms, $\mathcal{L}_h = (1-\lambda_{\text{boot}}-\lambda_{\text{up}})\mathcal{L}_o + \mathcal{L}_{b} + \mathcal{L}_{u}$, and uses $\mathcal{L}_{b} = \frac{\lambda_{\text{boot}}}{N} \sum_{i \in N} \mathcal{L}^i_b$ to reinforce multi-view consistency. Empirically, the approach yields state-of-the-art results across diverse datasets (Mip-NeRF360, Tanks & Temples, DeepBlending, BungeeNeRF) with faster training than prior bootstrapping pipelines and improved generalization to out-of-distribution poses, while mitigating zoom-in artifacts and enriching fine details in zoomed views.

Abstract

The emergence of 3D Gaussian Splatting (3D-GS) has significantly advanced 3D reconstruction by providing high fidelity and fast training speeds across various scenarios. While recent efforts have mainly focused on improving model structures to compress data volume or reduce artifacts during zoom-in and zoom-out operations, they often overlook an underlying issue: training sampling deficiency. In zoomed-in views, Gaussian primitives can appear unregulated and distorted due to their dilation limitations and the insufficient availability of scale-specific training samples. Consequently, incorporating pseudo-details that ensure the completeness and alignment of the scene becomes essential. In this paper, we introduce a new training method that integrates diffusion models and multi-scale training using pseudo-ground-truth data. This approach not only notably mitigates the dilation and zoomed-in artifacts but also enriches reconstructed scenes with precise details out of existing scenarios. Our method achieves state-of-the-art performance across various benchmarks and extends the capabilities of 3D reconstruction beyond training datasets.

Beyond Existance: Fulfill 3D Reconstructed Scenes with Pseudo Details

TL;DR

This work tackles zoom-in artifacts in 3D Gaussian Splatting (3D-GS) arising from training sampling gaps. It introduces a hybrid training pipeline that couples Bootstrap-GS with upscale diffusion models (Latent Diffusion Models) to generate high-frequency, scene-consistent pseudo-ground-truth while employing a frequency-aware interpolation and filtering strategy. The final loss blends original 3D-GS objectives with bootstrapping and upscaling terms, , and uses to reinforce multi-view consistency. Empirically, the approach yields state-of-the-art results across diverse datasets (Mip-NeRF360, Tanks & Temples, DeepBlending, BungeeNeRF) with faster training than prior bootstrapping pipelines and improved generalization to out-of-distribution poses, while mitigating zoom-in artifacts and enriching fine details in zoomed views.

Abstract

The emergence of 3D Gaussian Splatting (3D-GS) has significantly advanced 3D reconstruction by providing high fidelity and fast training speeds across various scenarios. While recent efforts have mainly focused on improving model structures to compress data volume or reduce artifacts during zoom-in and zoom-out operations, they often overlook an underlying issue: training sampling deficiency. In zoomed-in views, Gaussian primitives can appear unregulated and distorted due to their dilation limitations and the insufficient availability of scale-specific training samples. Consequently, incorporating pseudo-details that ensure the completeness and alignment of the scene becomes essential. In this paper, we introduce a new training method that integrates diffusion models and multi-scale training using pseudo-ground-truth data. This approach not only notably mitigates the dilation and zoomed-in artifacts but also enriches reconstructed scenes with precise details out of existing scenarios. Our method achieves state-of-the-art performance across various benchmarks and extends the capabilities of 3D reconstruction beyond training datasets.

Paper Structure

This paper contains 34 sections, 1 theorem, 12 equations, 8 figures, 14 tables.

Key Result

Theorem 1

Mathematically, let $X_1, X_2, \dots, X_n$ be i.i.d. random variables with expected value $\mu = \mathbb{E}[X_i]$ and variance $\sigma^2 = \operatorname{Var}(X_i) < \infty$, then, for any $\varepsilon > 0$:

Figures (8)

  • Figure 1: Rendering comparisons. While most methods render images faithfully at standard scales, their zoomed-in views exhibit significant artifacts that are not apparent at normal magnifications.
  • Figure 2: Signal visualization of the rendering process of Gaussian primitives. (a) presents the zoomed-in view, while (b) displays the normal view. The left side is the Gaussians and the right side is their corresponding signals. The original signal is represented by the blue curve, and the sampled signal during rendering is indicated by the red curve. After discrete sampling and filtering, high-frequency details—represented by the red Gaussian and brown Gaussian—are observable only in the zoomed-in view but have negligible effects on the normal view, where high-frequency details are filtered out presented in the right bottom of (b).
  • Figure 3: Visualization of our pipeline. (a) Starting from an original camera, we first narrow its field of view to construct corresponding zoomed-in cameras and bootstrapping cameras. We render images from these cameras (at a reduced scale for upscaling). These new renderings are then fed into diffusion models for regeneration. (b) The regenerated images are used as pseudo ground truth to contribute to our hybrid loss function.
  • Figure 4: Main comparisons. (a) For extremely small details present in the ground truth images, our method effectively completes and reconstructs these details. (b) In scenarios involving partially occluded views within the training datasets, our random upscale sampling technique enables the generation of fine-grained details that align with the ground truth. (c) For vague or indistinct details even in the ground truth, our upscaling diffusion models are capable of denoising the ground truth and generating high-frequency details. Note: All our renderings are produced from zoomed-in views, for which there is no directly aligned ground truth available.
  • Figure 5: Object rendering comparisons. Our method enables the generation of fine-grained, flexible, and scene-aligned details using upscaling diffusion models on zoomed-in scales without impacting the integrity of the original rendering on normal scales.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Theorem 1