Table of Contents
Fetching ...

FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain

YuAn Wang, Xiaofan Li, Chi Huang, Wenhao Zhang, Hao Li, Bosheng Wang, Xun Sun, Jun Wang

TL;DR

FaithFusion addresses the problem of reconciling geometric fidelity in 3DGS-based driving-scene reconstruction with plausible appearance generation under large viewpoint shifts. It introduces pixel-wise Expected Information Gain (EIG) as a unified policy that guides diffusion as a spatial prior and as a loss weight to distill edits back into 3DGS; the method derives a tractable upper bound via the Laplace approximation leading to $\text{EIG} \le \frac{1}{2} \operatorname{tr}\left(H''[Y_{NVS}|X_{NVS},\boldsymbol{\omega}^*](H''[\boldsymbol{\omega}^*])^{-1}\right)$ and distributes the information across pixels along rays. The approach comprises a dual-branch EIGent generator and a progressive diffusion-to-3DGS integration that operates without extra priors. Experiments on Waymo show FaithFusion achieving state-of-the-art results across NTA-IoU, NTL-IoU, and FID, including robustness to lane shifts up to $6$ meters, with FID reaching $107.47$. The work offers a general, plug-and-play framework for unified, controllable 4D driving-scene modeling with potential for active mapping extensions.

Abstract

In controllable driving-scene reconstruction and 3D scene generation, maintaining geometric fidelity while synthesizing visually plausible appearance under large viewpoint shifts is crucial. However, effective fusion of geometry-based 3DGS and appearance-driven diffusion models faces inherent challenges, as the absence of pixel-wise, 3D-consistent editing criteria often leads to over-restoration and geometric drift. To address these issues, we introduce \textbf{FaithFusion}, a 3DGS-diffusion fusion framework driven by pixel-wise Expected Information Gain (EIG). EIG acts as a unified policy for coherent spatio-temporal synthesis: it guides diffusion as a spatial prior to refine high-uncertainty regions, while its pixel-level weighting distills the edits back into 3DGS. The resulting plug-and-play system is free from extra prior conditions and structural modifications.Extensive experiments on the Waymo dataset demonstrate that our approach attains SOTA performance across NTA-IoU, NTL-IoU, and FID, maintaining an FID of 107.47 even at 6 meters lane shift. Our code is available at https://github.com/wangyuanbiubiubiu/FaithFusion.

FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain

TL;DR

FaithFusion addresses the problem of reconciling geometric fidelity in 3DGS-based driving-scene reconstruction with plausible appearance generation under large viewpoint shifts. It introduces pixel-wise Expected Information Gain (EIG) as a unified policy that guides diffusion as a spatial prior and as a loss weight to distill edits back into 3DGS; the method derives a tractable upper bound via the Laplace approximation leading to and distributes the information across pixels along rays. The approach comprises a dual-branch EIGent generator and a progressive diffusion-to-3DGS integration that operates without extra priors. Experiments on Waymo show FaithFusion achieving state-of-the-art results across NTA-IoU, NTL-IoU, and FID, including robustness to lane shifts up to meters, with FID reaching . The work offers a general, plug-and-play framework for unified, controllable 4D driving-scene modeling with potential for active mapping extensions.

Abstract

In controllable driving-scene reconstruction and 3D scene generation, maintaining geometric fidelity while synthesizing visually plausible appearance under large viewpoint shifts is crucial. However, effective fusion of geometry-based 3DGS and appearance-driven diffusion models faces inherent challenges, as the absence of pixel-wise, 3D-consistent editing criteria often leads to over-restoration and geometric drift. To address these issues, we introduce \textbf{FaithFusion}, a 3DGS-diffusion fusion framework driven by pixel-wise Expected Information Gain (EIG). EIG acts as a unified policy for coherent spatio-temporal synthesis: it guides diffusion as a spatial prior to refine high-uncertainty regions, while its pixel-level weighting distills the edits back into 3DGS. The resulting plug-and-play system is free from extra prior conditions and structural modifications.Extensive experiments on the Waymo dataset demonstrate that our approach attains SOTA performance across NTA-IoU, NTL-IoU, and FID, maintaining an FID of 107.47 even at 6 meters lane shift. Our code is available at https://github.com/wangyuanbiubiubiu/FaithFusion.

Paper Structure

This paper contains 18 sections, 16 equations, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: Comparative overview. Comparison of FreeVS wang2024freevs, OmniRe chen2024omnire, the fusion-based methods DIFIX3D+ wu2025difix3d+ and ReconDreamer++ zhao2025recondreamer++, and our EIG-integrated FaithFusion, which simultaneously achieves consistency, quality, and faithfulness.
  • Figure 2: FaithFusion pipeline. The EIG-guided progressive training loop with three steps: Step 1:Novel-view synthesis. Render laterally offset novel views and their pixel-level EIG maps from the original 3DGS. Step 2:EIGent Fixed. Feed the renders and EIG maps into EIGent to repair high-EIG regions---using Video DiT early for spatio-temporal consistency and DIFIX3D+ later for per-frame perceptual refinement. Step 3:EIG-guided 3DGS Update. Fine-tune the 3DGS model with the EIGent-restored views and EIG maps.
  • Figure 3: Image quality vs. EIG mask threshold. We validate pixel-level EIG as a proxy for novel-view synthesis quality by progressively retaining high-EIG regions and evaluating $\text{PSNR}$. The consistent decrease in $\text{PSNR}$ as high-$\text{EIG}$ regions are retained confirms that higher EIG marks lower-quality rendering.
  • Figure 4: Overview of EIGent.Data: Cross-view pairing: a forward-camera–trained 3DGS renders right-front views to produce artifact-prone novel-view renders and per-pixel EIG (Alg. \ref{['alg:EIG_compute']}), temporally aligned with real right-front videos. Architecture: EIGent is a dual-branch model with coarse-to-fine EIG guidance: downsampled $E$, noise latent $L_N$, and VAE latent $L$ feed a lightweight context encoder $\mathcal{G}$; a mask $M$ suppresses high–EIG regions. Via cross-attention with a DIFIX branch, cues are injected into a pretrained DiT backbone, enabling EIG-aware controllable repair and foreground spatio-temporal consistency.
  • Figure 5: Qualitative comparison on Waymo sun2020scalability. Novel-view renderings for the same trajectory across representative methods wang2024freevsni2025recondreamerzhao2025recondreamer++wu2025difix3d+. Orange boxes highlight regions where our approach yields noticeably better results.
  • ...and 5 more figures