Table of Contents
Fetching ...

LDM-ISP: Enhancing Neural ISP for Low Light with Latent Diffusion Models

Qiang Wen, Zhefan Rao, Yazhou Xing, Qifeng Chen

TL;DR

The paper addresses extreme low-light image enhancement by casting LLIE as a generative restoration guided by a pre-trained latent diffusion model. It introduces LDM-ISP, which tamps a frozen Stable Diffusion backbone with lightweight taming modules and uses 2D discrete wavelet transforms to split the task into low-frequency structure generation and high-frequency detail maintenance. By allocating LL to the UNet and HF content to the decoder, the approach exploits distinct generative priors and achieves state-of-the-art perceptual performance on real LLIE datasets. The method demonstrates practical benefits for neural ISP under challenging lighting, reducing the need for extensive dataset collection and full diffusion fine-tuning while delivering high-fidelity, artifact-free sRGB outputs.

Abstract

Enhancing a low-light noisy RAW image into a well-exposed and clean sRGB image is a significant challenge for modern digital cameras. Prior approaches have difficulties in recovering fine-grained details and true colors of the scene under extremely low-light environments due to near-to-zero SNR. Meanwhile, diffusion models have shown significant progress towards general domain image generation. In this paper, we propose to leverage the pre-trained latent diffusion model to perform the neural ISP for enhancing extremely low-light images. Specifically, to tailor the pre-trained latent diffusion model to operate on the RAW domain, we train a set of lightweight taming modules to inject the RAW information into the diffusion denoising process via modulating the intermediate features of UNet. We further observe different roles of UNet denoising and decoder reconstruction in the latent diffusion model, which inspires us to decompose the low-light image enhancement task into latent-space low-frequency content generation and decoding-phase high-frequency detail maintenance. Through extensive experiments on representative datasets, we demonstrate our simple design not only achieves state-of-the-art performance in quantitative evaluations but also shows significant superiority in visual comparisons over strong baselines, which highlight the effectiveness of powerful generative priors for neural ISP under extremely low-light environments. The project page is available at https://csqiangwen.github.io/projects/ldm-isp/

LDM-ISP: Enhancing Neural ISP for Low Light with Latent Diffusion Models

TL;DR

The paper addresses extreme low-light image enhancement by casting LLIE as a generative restoration guided by a pre-trained latent diffusion model. It introduces LDM-ISP, which tamps a frozen Stable Diffusion backbone with lightweight taming modules and uses 2D discrete wavelet transforms to split the task into low-frequency structure generation and high-frequency detail maintenance. By allocating LL to the UNet and HF content to the decoder, the approach exploits distinct generative priors and achieves state-of-the-art perceptual performance on real LLIE datasets. The method demonstrates practical benefits for neural ISP under challenging lighting, reducing the need for extensive dataset collection and full diffusion fine-tuning while delivering high-fidelity, artifact-free sRGB outputs.

Abstract

Enhancing a low-light noisy RAW image into a well-exposed and clean sRGB image is a significant challenge for modern digital cameras. Prior approaches have difficulties in recovering fine-grained details and true colors of the scene under extremely low-light environments due to near-to-zero SNR. Meanwhile, diffusion models have shown significant progress towards general domain image generation. In this paper, we propose to leverage the pre-trained latent diffusion model to perform the neural ISP for enhancing extremely low-light images. Specifically, to tailor the pre-trained latent diffusion model to operate on the RAW domain, we train a set of lightweight taming modules to inject the RAW information into the diffusion denoising process via modulating the intermediate features of UNet. We further observe different roles of UNet denoising and decoder reconstruction in the latent diffusion model, which inspires us to decompose the low-light image enhancement task into latent-space low-frequency content generation and decoding-phase high-frequency detail maintenance. Through extensive experiments on representative datasets, we demonstrate our simple design not only achieves state-of-the-art performance in quantitative evaluations but also shows significant superiority in visual comparisons over strong baselines, which highlight the effectiveness of powerful generative priors for neural ISP under extremely low-light environments. The project page is available at https://csqiangwen.github.io/projects/ldm-isp/
Paper Structure (20 sections, 4 equations, 5 figures, 2 tables)

This paper contains 20 sections, 4 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The detection results obtained from YOLOv8 yolov8 demonstrate significant improvements from the proposed method. The proposed method yields clearer low-light enhancements, facilitating more accurate detection results.
  • Figure 2: (a) The overview of our proposed LLIE method, LDM-ISP. The Bayer RAW image is processed by similar operations mentioned in chen2018learning. A series of 2D discrete wavelet transforms (DWT) is applied to the processed image for capturing the low-frequency (LL) and high-frequency subbands (LH, HL, HH). The low-frequency subband (LL) serves to modulate the feature at each layer in the UNet. Specifically, each feature has its corresponding taming module, whose key part is an SFT layer, to map the sub-band into a pair of scale $\gamma$ and shift $\beta$ parameters. Similar to the low-frequency taming, the features of the decoder $\mathcal{D}$ are modulated by another set of taming modules, where the LL sub-band is mapped to the scale $\gamma$ and the concatenation of LH, HL, HH is mapped to the shift $\beta$. All parameters from the pre-trained Stable Diffusion are frozen and only taming modules are trainable. (b) The low-frequency sub-band, extracted using 2D discrete wavelet transforms (DWT), reveals clearer structural information compared to the low-light input.
  • Figure 3: A text-to-image example from Stable Diffusion. The difference map between the latent representation and sRGB output indicates that the latent representation generated by the UNet shows essential structural content while the decoder $\mathcal{D}$ introduces abundant details but scarcely modifies the global structure for the final sRGB output. The latent representation is converted to sRGB visualization by the linear transformation Keturn2023.
  • Figure 4: The qualitative evaluations on SID-Sony chen2018learning and ELD-Sony wei2021physics datasets (Zoom-in for best view). The proposed method shows notable superiority in recovering structural information and enhancing details in extremely dark and noisy regions. (The illumination of the low-light input is increased for visualization.)
  • Figure 5: The visualization of the ablation study. Different taming strategies (resize and DWT) and taming targets (UNet and decoder $\mathcal{D}$) are combined in this study. (The illumination of the low-light input is increased for visualization.)