Table of Contents
Fetching ...

LEDiff: Latent Exposure Diffusion for HDR Generation

Chao Wang, Zhihao Xia, Thomas Leimkuehler, Karol Myszkowski, Xuaner Zhang

TL;DR

LEDiff tackles the limitation of 8-bit LDR content by enabling HDR generation within pre-trained latent diffusion models through latent space exposure fusion. It preserves the original latent space and uses a small HDR dataset to train a learnable fusion module and a fine-tuned HDR decoder, complemented by highlight and shadow denoisers to hallucinate clipped regions. The approach supports both HDR content generation (text-to-HDR, panorama, and video) and LDR-to-HDR reconstruction, showing competitive HDR quality across objective metrics and user studies. This work provides a practical, plug-and-play pathway to photorealistic HDR outputs for generative content and HDR-enabled downstream tasks like image-based lighting and depth-of-field rendering. It also demonstrates robust performance with limited HDR data, suggesting strong potential for broad adoption and further improvements in HDR-aware generative modeling.

Abstract

While consumer displays increasingly support more than 10 stops of dynamic range, most image assets such as internet photographs and generative AI content remain limited to 8-bit low dynamic range (LDR), constraining their utility across high dynamic range (HDR) applications. Currently, no generative model can produce high-bit, high-dynamic range content in a generalizable way. Existing LDR-to-HDR conversion methods often struggle to produce photorealistic details and physically-plausible dynamic range in the clipped areas. We introduce LEDiff, a method that enables a generative model with HDR content generation through latent space fusion inspired by image-space exposure fusion techniques. It also functions as an LDR-to-HDR converter, expanding the dynamic range of existing low-dynamic range images. Our approach uses a small HDR dataset to enable a pretrained diffusion model to recover detail and dynamic range in clipped highlights and shadows. LEDiff brings HDR capabilities to existing generative models and converts any LDR image to HDR, creating photorealistic HDR outputs for image generation, image-based lighting (HDR environment map generation), and photographic effects such as depth of field simulation, where linear HDR data is essential for realistic quality.

LEDiff: Latent Exposure Diffusion for HDR Generation

TL;DR

LEDiff tackles the limitation of 8-bit LDR content by enabling HDR generation within pre-trained latent diffusion models through latent space exposure fusion. It preserves the original latent space and uses a small HDR dataset to train a learnable fusion module and a fine-tuned HDR decoder, complemented by highlight and shadow denoisers to hallucinate clipped regions. The approach supports both HDR content generation (text-to-HDR, panorama, and video) and LDR-to-HDR reconstruction, showing competitive HDR quality across objective metrics and user studies. This work provides a practical, plug-and-play pathway to photorealistic HDR outputs for generative content and HDR-enabled downstream tasks like image-based lighting and depth-of-field rendering. It also demonstrates robust performance with limited HDR data, suggesting strong potential for broad adoption and further improvements in HDR-aware generative modeling.

Abstract

While consumer displays increasingly support more than 10 stops of dynamic range, most image assets such as internet photographs and generative AI content remain limited to 8-bit low dynamic range (LDR), constraining their utility across high dynamic range (HDR) applications. Currently, no generative model can produce high-bit, high-dynamic range content in a generalizable way. Existing LDR-to-HDR conversion methods often struggle to produce photorealistic details and physically-plausible dynamic range in the clipped areas. We introduce LEDiff, a method that enables a generative model with HDR content generation through latent space fusion inspired by image-space exposure fusion techniques. It also functions as an LDR-to-HDR converter, expanding the dynamic range of existing low-dynamic range images. Our approach uses a small HDR dataset to enable a pretrained diffusion model to recover detail and dynamic range in clipped highlights and shadows. LEDiff brings HDR capabilities to existing generative models and converts any LDR image to HDR, creating photorealistic HDR outputs for image generation, image-based lighting (HDR environment map generation), and photographic effects such as depth of field simulation, where linear HDR data is essential for realistic quality.

Paper Structure

This paper contains 14 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: LEDiff enables high dynamic range (HDR) content generation with photorealistic details in both over- and under-exposed regions by performing exposure fusion in latent space, making it applicable to generated content and real photos mapped to the latent space. While existing generative models (e.g., Stable Diffusion) are restricted to low dynamic range (a) and standard cameras struggle to capture full scene dynamic range, causing clipping in highlights (b) and shadows (c), LEDiff restores both detail and dynamic range (d)--(f), as shown in scanline plots. All HDR images are tone-mapped for visualization and are best viewed on an HDR display. See the supplemental for more details.
  • Figure 2: Finetuning scheme of decoder and denoiser. (Left: fine-tuning the decoder) Exposure bracketed images $I_{+}, I_{0}, I_{-}$ are encoded via the pre-trained encoder to generate corresponding latent codes. These latent codes are fused using a learnable fusion module $\mathcal{F}$ to produce a latent code $\mathcal{C}_{\text{merge}}$ free of clipping, which is then decoded into an HDR image $\mathcal{H}$ through the finetuned decoder. (Right: fine-tuning the denoiser) The model takes as input the latent code $\mathcal{C}_{+}$ for training highlight denoiser $\epsilon_{\theta_{-}}$ or $\mathcal{C}_{-}$ for training shadow denoiser $\epsilon_{\theta_{+}}$, along with a $\mathcal{C}_{0}$ corrupted by randomly sampled noise.
  • Figure 3: Limitations of the vanilla Stable Diffusion (SD) in generating HDR content. Left: The limitation of the SD VAE in encoding and decoding an HDR image, visualized in multiple exposure levels, which reveals a significant fidelity loss, especially in the shadow. Right: An image generated with SD, along with a scanline that shows pixel clipping in highlight and shadow regions.
  • Figure 4: Left: Text to HDR image generation using prompts: (1) "Bright car headlights on a narrow street at night" and (2) "A grand church interior with tall stained glass windows, intricate wooden arches, and warm lighting". We compare the images generated by SD and LEDiff. We plot a scanline passing through the car headlights and the bright luminaries to demonstrate our ability to produce a wide dynamic range with photorealistic detail. We also show an application of synthetic depth of field ($\textbf{DoF}$), where a linear HDR image is crucial for rendering realistic defocus effects. Right: Comparisons are made between the panoramas produced by MVDiffusion Tang2023mvdiffusion and our approach with a prompt "A peaceful beach at sunset with soft clouds in the sky", along with their respective image-based lighting results. Our method leads to higher contrast and an overall more realistic appearance.
  • Figure 5: LDR-to-HDR image reconstruction comparisons. Our method effectively hallucinates details in both over- and under-exposed regions, while previous approaches eilertsen2017hdrsantos2020singleliu2020singlemarnerides2018expandnetwang2023glowgan struggle to produce plausible results, especially in shadow regions that they do not address (e.g., HDRCNN and MaskHDR yield identical results for shadow hallucination, as both methods process non-clipped regions in the same way.). Images are tone-mapped for visualization. Best viewed in HDR on an HDR display; see the supplemental for further details.
  • ...and 1 more figures