Table of Contents
Fetching ...

DarkDiff: Advancing Low-Light Raw Enhancement by Retasking Diffusion Models for Camera ISP

Amber Yijia Zheng, Yu Zhang, Jun Hu, Raymond A. Yeh, Chen Chen

TL;DR

DarkDiff addresses extreme low-light raw image enhancement by retasking a pre-trained diffusion model to operate within the camera ISP. It introduces an ISP-aware data pipeline, region-based cross-attention conditioning, a content-preserving residual VAE, and a decoder-space reconstruction loss to mitigate color shifts, enabling high perceptual quality as measured by $LPIPS$ across SID, ELD, and LRD while maintaining competitive $PSNR$ and $SSIM$. Quantitative and qualitative results show DarkDiff outperforming regression-based and diffusion-from-scratch baselines in perceptual fidelity, with ablations validating the necessity of each component. The approach leverages pre-trained diffusion capabilities to reduce data requirements and achieve practical improvements for low-light photography, though it trade-offs inference speed and depends on the base diffusion model’s strengths.

Abstract

High-quality photography in extreme low-light conditions is challenging but impactful for digital cameras. With advanced computing hardware, traditional camera image signal processor (ISP) algorithms are gradually being replaced by efficient deep networks that enhance noisy raw images more intelligently. However, existing regression-based models often minimize pixel errors and result in oversmoothing of low-light photos or deep shadows. Recent work has attempted to address this limitation by training a diffusion model from scratch, yet those models still struggle to recover sharp image details and accurate colors. We introduce a novel framework to enhance low-light raw images by retasking pre-trained generative diffusion models with the camera ISP. Extensive experiments demonstrate that our method outperforms the state-of-the-art in perceptual quality across three challenging low-light raw image benchmarks.

DarkDiff: Advancing Low-Light Raw Enhancement by Retasking Diffusion Models for Camera ISP

TL;DR

DarkDiff addresses extreme low-light raw image enhancement by retasking a pre-trained diffusion model to operate within the camera ISP. It introduces an ISP-aware data pipeline, region-based cross-attention conditioning, a content-preserving residual VAE, and a decoder-space reconstruction loss to mitigate color shifts, enabling high perceptual quality as measured by across SID, ELD, and LRD while maintaining competitive and . Quantitative and qualitative results show DarkDiff outperforming regression-based and diffusion-from-scratch baselines in perceptual fidelity, with ablations validating the necessity of each component. The approach leverages pre-trained diffusion capabilities to reduce data requirements and achieve practical improvements for low-light photography, though it trade-offs inference speed and depends on the base diffusion model’s strengths.

Abstract

High-quality photography in extreme low-light conditions is challenging but impactful for digital cameras. With advanced computing hardware, traditional camera image signal processor (ISP) algorithms are gradually being replaced by efficient deep networks that enhance noisy raw images more intelligently. However, existing regression-based models often minimize pixel errors and result in oversmoothing of low-light photos or deep shadows. Recent work has attempted to address this limitation by training a diffusion model from scratch, yet those models still struggle to recover sharp image details and accurate colors. We introduce a novel framework to enhance low-light raw images by retasking pre-trained generative diffusion models with the camera ISP. Extensive experiments demonstrate that our method outperforms the state-of-the-art in perceptual quality across three challenging low-light raw image benchmarks.

Paper Structure

This paper contains 16 sections, 10 equations, 18 figures, 7 tables.

Figures (18)

  • Figure 1: Comparisons of low-light raw image enhancement results. The two input raw images were captured at night with only 0.1s and 0.033s exposure time by a Sony A7SII camera chen2018learning. A digital gain of 300 and gamma correction have been applied for visualization. With sharp and vivid content, our results are comparable to the reference images captured with 300 times longer exposure on a tripod.
  • Figure 2: Overview of the proposed DarkDiff pipeline. The noisy linear RBG (LRGB) image ${\mathbf{y}}$ is processed by the encoder ${\mathcal{E}}$ to generate latent representations ${\mathbf{z}}_{\mathbf{y}}$. These representations and the Gaussian noise are fed into the Denoising U-Net, which integrates a region-based cross-attention between the noisy image and context from the pre-trained model to refine the latent variables ${\mathbf{z}}_t$. The decoder ${\mathcal{D}}$ reconstructs the final clean SRGB image ${\mathbf{x}}$. The region-based cross-attention mechanism (Sec. \ref{['sec:x-atten']}) allows the model to leverage contextual information at each denoising step for better detail preservation and noise reduction.
  • Figure 3: Naively using the noisy image as the conditional image in LDM fails to preserve local structures and leads to hallucinations.
  • Figure 4: VAE reconstruction results with and without our proposed residual architecture. We observe a loss in input details when not using a residual connection.
  • Figure 5: Our data processing pipeline converts Bayer raw input into a linear RGB format by applying white balance and demosaicing to the packed and amplified data. This linear RGB image is then passed to the diffusion model to produce the final sRGB image.
  • ...and 13 more figures