Table of Contents
Fetching ...

Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models

Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, Huan Ling

TL;DR

Difix3D+ introduces a unified pipeline that leverages a single-step diffusion model, Difix, to both improve 3D reconstructions from NeRF and 3D Gaussian Splatting and to provide real-time post-render enhancements. By distilling improved novel views back into the 3D representation through a progressive 3D update process and applying a fast post-render refinement, the approach achieves strong multi-view consistency and perceptual quality while remaining computationally efficient. The method demonstrates notable improvements in FID and PSNR across in-the-wild and automotive driving datasets, and is compatible with both implicit and explicit 3D representations, enabling practical deployment. Overall, Difix3D+ offers a fast, diffusion-prior-based solution to persistent artifacts in 3Dnovel-view synthesis, with potential for real-time applications and scalability to large scenes.

Abstract

Neural Radiance Fields and 3D Gaussian Splatting have revolutionized 3D reconstruction and novel-view synthesis task. However, achieving photorealistic rendering from extreme novel viewpoints remains challenging, as artifacts persist across representations. In this work, we introduce Difix3D+, a novel pipeline designed to enhance 3D reconstruction and novel-view synthesis through single-step diffusion models. At the core of our approach is Difix, a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views caused by underconstrained regions of the 3D representation. Difix serves two critical roles in our pipeline. First, it is used during the reconstruction phase to clean up pseudo-training views that are rendered from the reconstruction and then distilled back into 3D. This greatly enhances underconstrained regions and improves the overall 3D representation quality. More importantly, Difix also acts as a neural enhancer during inference, effectively removing residual artifacts arising from imperfect 3D supervision and the limited capacity of current reconstruction models. Difix3D+ is a general solution, a single model compatible with both NeRF and 3DGS representations, and it achieves an average 2$\times$ improvement in FID score over baselines while maintaining 3D consistency.

Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models

TL;DR

Difix3D+ introduces a unified pipeline that leverages a single-step diffusion model, Difix, to both improve 3D reconstructions from NeRF and 3D Gaussian Splatting and to provide real-time post-render enhancements. By distilling improved novel views back into the 3D representation through a progressive 3D update process and applying a fast post-render refinement, the approach achieves strong multi-view consistency and perceptual quality while remaining computationally efficient. The method demonstrates notable improvements in FID and PSNR across in-the-wild and automotive driving datasets, and is compatible with both implicit and explicit 3D representations, enabling practical deployment. Overall, Difix3D+ offers a fast, diffusion-prior-based solution to persistent artifacts in 3Dnovel-view synthesis, with potential for real-time applications and scalability to large scenes.

Abstract

Neural Radiance Fields and 3D Gaussian Splatting have revolutionized 3D reconstruction and novel-view synthesis task. However, achieving photorealistic rendering from extreme novel viewpoints remains challenging, as artifacts persist across representations. In this work, we introduce Difix3D+, a novel pipeline designed to enhance 3D reconstruction and novel-view synthesis through single-step diffusion models. At the core of our approach is Difix, a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views caused by underconstrained regions of the 3D representation. Difix serves two critical roles in our pipeline. First, it is used during the reconstruction phase to clean up pseudo-training views that are rendered from the reconstruction and then distilled back into 3D. This greatly enhances underconstrained regions and improves the overall 3D representation quality. More importantly, Difix also acts as a neural enhancer during inference, effectively removing residual artifacts arising from imperfect 3D supervision and the limited capacity of current reconstruction models. Difix3D+ is a general solution, a single model compatible with both NeRF and 3DGS representations, and it achieves an average 2 improvement in FID score over baselines while maintaining 3D consistency.

Paper Structure

This paper contains 44 sections, 10 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: We demonstrate Difix3D+ on both in-the-wild scenes (top) and driving scenes (bottom). Recent Novel-View Synthesis methods struggle in sparse-input settings or when rendering views far from the input camera poses. Difix distills the priors of 2D generative models to enhance reconstruction quality and can further act as a neural-renderer at inference time to mitigate the remaining inconsistencies. Notably, the same model effectively corrects NeRF mildenhall2020nerf and 3DGS kerbl3Dgaussians artifacts.
  • Figure 2: Difix3D+ pipeline. The overall pipeline of the Difix3D+ model involves the following stages: Step 1: Given a pretrained 3D representation, we render novel views and feed them to Difix which acts as a neural enhancer, removing the artifacts and improving the quality of the noisy rendered views (\ref{['sec:single-step-diffusion']}). The camera poses selected to render the novel views are obtained through pose interpolation, gradually approaching the target poses from the reference ones. Step 2: The cleaned novel views are distilled back to the 3D representation to improve its quality (\ref{['sec:3d_consistency']}). Steps 1 and 2 are applied in several iterations to progressively grow the spatial extent of the reconstruction and hence ensure strong conditioning of the diffusion model (Difix3D). Step 3: Difix additional acts as a real-time neural enhancer, further improving the quality of the rendered novel views.
  • Figure 3: Difix architecture.Difix takes a noisy rendered image and a reference views as input (left), and outputs an enhanced version of the input image with reduced artifacts (right). Difix also generates identical reference views, which we discard in practice and hence depict transparent. The model architecture consists of a U-Net structure with a cross-view reference mixing layer (\ref{['sec:single-step-diffusion']}) to maintain consistency across reference views. Difix is fine-tuned from SD-Turbo, using a frozen VAE encoder and a LoRA fine-tuned decoder.
  • Figure 4: Noise level. To validate our hypothesis that the distribution of images with NeRF/3DGS artifacts is similar to the distribution of noisy images used to train SD-Turbo sauer2025adversarial, we perform single-step "denoising" at varying noise levels. At higher noise levels (e.g., $\tau = 600$), the model effectively removes artifacts but also alters the image context. At lower noise levels (e.g., $\tau = 10$), the model makes only minor adjustments, leaving most artifacts intact. $\tau = 200$ strikes a good balance, removing artifacts while preserving context, and achieves the highest metrics.
  • Figure 5: In-the-wild artifact removal. We show comparisons on held-out scenes from the DL3DV dataset ling2024dl3dv (top, above the dashed line) and the Nerfbusters warburg2023nerfbusters dataset (bottom). Difix3D+ corrects significantly more artifacts that other methods.
  • ...and 5 more figures