Table of Contents
Fetching ...

DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer

Yuxuan Zhang, Katarína Tóthová, Zian Wang, Kangxue Yin, Haithem Turki, Riccardo de Lutio, Yen-Yu Chang, Or Litany, Sanja Fidler, Zan Gojcic

TL;DR

DiffusionHarmonizer is an online generative enhancement framework that transforms renderings from such imperfect scenes into temporally consistent outputs while improving their realism, and is a scalable system that significantly elevates simulation fidelity in both research and production environments.

Abstract

Simulation is essential to the development and evaluation of autonomous robots such as self-driving vehicles. Neural reconstruction is emerging as a promising solution as it enables simulating a wide variety of scenarios from real-world data alone in an automated and scalable way. However, while methods such as NeRF and 3D Gaussian Splatting can produce visually compelling results, they often exhibit artifacts particularly when rendering novel views, and fail to realistically integrate inserted dynamic objects, especially when they were captured from different scenes. To overcome these limitations, we introduce DiffusionHarmonizer, an online generative enhancement framework that transforms renderings from such imperfect scenes into temporally consistent outputs while improving their realism. At its core is a single-step temporally-conditioned enhancer that is converted from a pretrained multi-step image diffusion model, capable of running in online simulators on a single GPU. The key to training it effectively is a custom data curation pipeline that constructs synthetic-real pairs emphasizing appearance harmonization, artifact correction, and lighting realism. The result is a scalable system that significantly elevates simulation fidelity in both research and production environments.

DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer

TL;DR

DiffusionHarmonizer is an online generative enhancement framework that transforms renderings from such imperfect scenes into temporally consistent outputs while improving their realism, and is a scalable system that significantly elevates simulation fidelity in both research and production environments.

Abstract

Simulation is essential to the development and evaluation of autonomous robots such as self-driving vehicles. Neural reconstruction is emerging as a promising solution as it enables simulating a wide variety of scenarios from real-world data alone in an automated and scalable way. However, while methods such as NeRF and 3D Gaussian Splatting can produce visually compelling results, they often exhibit artifacts particularly when rendering novel views, and fail to realistically integrate inserted dynamic objects, especially when they were captured from different scenes. To overcome these limitations, we introduce DiffusionHarmonizer, an online generative enhancement framework that transforms renderings from such imperfect scenes into temporally consistent outputs while improving their realism. At its core is a single-step temporally-conditioned enhancer that is converted from a pretrained multi-step image diffusion model, capable of running in online simulators on a single GPU. The key to training it effectively is a custom data curation pipeline that constructs synthetic-real pairs emphasizing appearance harmonization, artifact correction, and lighting realism. The result is a scalable system that significantly elevates simulation fidelity in both research and production environments.
Paper Structure (18 sections, 7 equations, 13 figures, 6 tables)

This paper contains 18 sections, 7 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: DiffusionHarmonizer on Driving Scenes. Our method transforms artifact-prone neural-rendered frames into temporally coherent simulations, improving their realism by jointly correcting shadows, lighting, appearance discrepancies and reconstruction artifacts.
  • Figure 2: Overview of the data curation pipeline (top) and model architecture (bottom) of DiffusionHarmonizer. We use a single-step temporally conditioned enhancement model, that is converted from a pretrained multi-step image diffusion model. To train it effectively, we develop a data curation pipeline to construct synthetic--real pairs emphasizing harmonization, artifact correction and lighting realism.
  • Figure 3: Comparison with Image and Video Editing Baselines on Out-of-Domain Testing Data. Our method harmonizes color tone and synthesizes realistic lighting and shadows, while editing baselines often fail to produce physically plausible shadowing. Although both can reduce reconstruction artifacts, baselines tend to hallucinate inconsistent content and over-edit well-reconstructed regions, whereas our method preserves scene geometry and input structure. Moreover, image-editing baselines introduce frame-to-frame jitter, whereas our model maintains strong temporal coherence.
  • Figure 4: Comparison with Harmonization Baselines Methods. While both our method and harmonization baselines adjust foreground appearance, the baselines fail to synthesize realistic shadows, resulting in less coherent composites.
  • Figure 5: Ablation on Loss Design. Removing perceptual supervision leads to oversmoothed outputs, while using a conventional LPIPS loss produces high-frequency artifacts. Our multi-scale formulation mitigates these artifacts and yields perceptually better results.
  • ...and 8 more figures