Table of Contents
Fetching ...

Zero-Reference Lighting Estimation Diffusion Model for Low-Light Image Enhancement

Jinhong He, Minglong Xue, Aoxiang Ning, Chengyun Song

TL;DR

This work tackles unsupervised low-light image enhancement by removing reliance on paired data through a zero-reference diffusion framework called Zero-LED. It integrates a pluggable Initial Optimization Network to generate a structural and illumination decomposition, and performs diffusion in the wavelet low-frequency domain to reduce computation. A multi-modal Appearance Reconstruction Module (ARM) combines CLIP-based semantic guidance with frequency-domain constraints (edge and texture preservation) to steer content reconstruction and suppress artifacts. The method employs bidirectional supervisory signals and a suite of losses, achieving competitive quantitative performance and superior perceptual quality with strong generalization to real-world degradations. Overall, Zero-LED demonstrates that zero-reference diffusion training, together with frequency-domain and semantic guidance, can effectively bridge low-light and normal-light domains without paired data, enabling practical deployment.

Abstract

Diffusion model-based low-light image enhancement methods rely heavily on paired training data, leading to limited extensive application. Meanwhile, existing unsupervised methods lack effective bridging capabilities for unknown degradation. To address these limitations, we propose a novel zero-reference lighting estimation diffusion model for low-light image enhancement called Zero-LED. It utilizes the stable convergence ability of diffusion models to bridge the gap between low-light domains and real normal-light domains and successfully alleviates the dependence on pairwise training data via zero-reference learning. Specifically, we first design the initial optimization network to preprocess the input image and implement bidirectional constraints between the diffusion model and the initial optimization network through multiple objective functions. Subsequently, the degradation factors of the real-world scene are optimized iteratively to achieve effective light enhancement. In addition, we explore a frequency-domain based and semantically guided appearance reconstruction module that encourages feature alignment of the recovered image at a fine-grained level and satisfies subjective expectations. Finally, extensive experiments demonstrate the superiority of our approach to other state-of-the-art methods and more significant generalization capabilities. We will open the source code upon acceptance of the paper.

Zero-Reference Lighting Estimation Diffusion Model for Low-Light Image Enhancement

TL;DR

This work tackles unsupervised low-light image enhancement by removing reliance on paired data through a zero-reference diffusion framework called Zero-LED. It integrates a pluggable Initial Optimization Network to generate a structural and illumination decomposition, and performs diffusion in the wavelet low-frequency domain to reduce computation. A multi-modal Appearance Reconstruction Module (ARM) combines CLIP-based semantic guidance with frequency-domain constraints (edge and texture preservation) to steer content reconstruction and suppress artifacts. The method employs bidirectional supervisory signals and a suite of losses, achieving competitive quantitative performance and superior perceptual quality with strong generalization to real-world degradations. Overall, Zero-LED demonstrates that zero-reference diffusion training, together with frequency-domain and semantic guidance, can effectively bridge low-light and normal-light domains without paired data, enabling practical deployment.

Abstract

Diffusion model-based low-light image enhancement methods rely heavily on paired training data, leading to limited extensive application. Meanwhile, existing unsupervised methods lack effective bridging capabilities for unknown degradation. To address these limitations, we propose a novel zero-reference lighting estimation diffusion model for low-light image enhancement called Zero-LED. It utilizes the stable convergence ability of diffusion models to bridge the gap between low-light domains and real normal-light domains and successfully alleviates the dependence on pairwise training data via zero-reference learning. Specifically, we first design the initial optimization network to preprocess the input image and implement bidirectional constraints between the diffusion model and the initial optimization network through multiple objective functions. Subsequently, the degradation factors of the real-world scene are optimized iteratively to achieve effective light enhancement. In addition, we explore a frequency-domain based and semantically guided appearance reconstruction module that encourages feature alignment of the recovered image at a fine-grained level and satisfies subjective expectations. Finally, extensive experiments demonstrate the superiority of our approach to other state-of-the-art methods and more significant generalization capabilities. We will open the source code upon acceptance of the paper.
Paper Structure (14 sections, 19 equations, 6 figures, 2 tables)

This paper contains 14 sections, 19 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Comparison between state-of-the-art unsupervised methods and our method. It can be seen that these comparison methods appear to suffer from excessive noise, color distortion, and visual quality degradation.
  • Figure 2: The overall framework of our proposed Zero-LED is illustrated. It proposes a Bidirectional optimization approach combining a deep neural network and a diffusion model for training without reference images. The initial optimization network provides the structural image and preliminary optimization of unknown degradation factors for the diffusion process. The inference process further bridges the gap between degraded and normal light and is optimized by an objective function in both directions. Meanwhile, We effectively reduce the consumption of computational resources through wavelet transform. The bottom part shows in detail the pluggable initial optimization network we designed.
  • Figure 3: Framework diagram of our proposed appearance reconstruction module. multi-modal semantics focuses on guiding illumination enhancement and supervising the input of image and text features. Frequency-domain guidance focuses on supervising high-frequency details and constraining the generation of artifacts.
  • Figure 4: Visual comparison of low-light enhancement methods on the LSRW dataset.
  • Figure 5: Visual comparison of low-light enhancement methods on the LOLv1 dataset.
  • ...and 1 more figures