Zero-Reference Lighting Estimation Diffusion Model for Low-Light Image Enhancement
Jinhong He, Minglong Xue, Aoxiang Ning, Chengyun Song
TL;DR
This work tackles unsupervised low-light image enhancement by removing reliance on paired data through a zero-reference diffusion framework called Zero-LED. It integrates a pluggable Initial Optimization Network to generate a structural and illumination decomposition, and performs diffusion in the wavelet low-frequency domain to reduce computation. A multi-modal Appearance Reconstruction Module (ARM) combines CLIP-based semantic guidance with frequency-domain constraints (edge and texture preservation) to steer content reconstruction and suppress artifacts. The method employs bidirectional supervisory signals and a suite of losses, achieving competitive quantitative performance and superior perceptual quality with strong generalization to real-world degradations. Overall, Zero-LED demonstrates that zero-reference diffusion training, together with frequency-domain and semantic guidance, can effectively bridge low-light and normal-light domains without paired data, enabling practical deployment.
Abstract
Diffusion model-based low-light image enhancement methods rely heavily on paired training data, leading to limited extensive application. Meanwhile, existing unsupervised methods lack effective bridging capabilities for unknown degradation. To address these limitations, we propose a novel zero-reference lighting estimation diffusion model for low-light image enhancement called Zero-LED. It utilizes the stable convergence ability of diffusion models to bridge the gap between low-light domains and real normal-light domains and successfully alleviates the dependence on pairwise training data via zero-reference learning. Specifically, we first design the initial optimization network to preprocess the input image and implement bidirectional constraints between the diffusion model and the initial optimization network through multiple objective functions. Subsequently, the degradation factors of the real-world scene are optimized iteratively to achieve effective light enhancement. In addition, we explore a frequency-domain based and semantically guided appearance reconstruction module that encourages feature alignment of the recovered image at a fine-grained level and satisfies subjective expectations. Finally, extensive experiments demonstrate the superiority of our approach to other state-of-the-art methods and more significant generalization capabilities. We will open the source code upon acceptance of the paper.
