Table of Contents
Fetching ...

Low-light Image Enhancement via CLIP-Fourier Guided Wavelet Diffusion

Minglong Xue, Jinhong He, Wenhai Wang, Mingliang Zhou

TL;DR

This work tackles unstable and visually unsatisfactory low-light image enhancement by proposing CFWD, a diffusion-based method guided by multimodal CLIP semantics in a frequency-domain wavelet space. It combines a Wavelet Diffusion Model with a Multiscale Visual-Language Guidance Network and a High Frequency Perception Module to constrain content diversity and preserve fine details, using a composite loss that includes diffusion, spectral, and content terms. The approach yields state-of-the-art quantitative gains and superior perceptual quality across diverse real-world benchmarks, including high-resolution backlit scenes, while maintaining generalization to unseen conditions. The framework demonstrates the practical impact of fusing multimodal semantics with spectral-domain diffusion for robust, perceptually faithful low-light enhancement.

Abstract

Low-light image enhancement techniques have significantly progressed, but unstable image quality recovery and unsatisfactory visual perception are still significant challenges. To solve these problems, we propose a novel and robust low-light image enhancement method via CLIP-Fourier Guided Wavelet Diffusion, abbreviated as CFWD. Specifically, CFWD leverages multimodal visual-language information in the frequency domain space created by multiple wavelet transforms to guide the enhancement process. Multi-scale supervision across different modalities facilitates the alignment of image features with semantic features during the wavelet diffusion process, effectively bridging the gap between degraded and normal domains. Moreover, to further promote the effective recovery of the image details, we combine the Fourier transform based on the wavelet transform and construct a Hybrid High Frequency Perception Module (HFPM) with a significant perception of the detailed features. This module avoids the diversity confusion of the wavelet diffusion process by guiding the fine-grained structure recovery of the enhancement results to achieve favourable metric and perceptually oriented enhancement. Extensive quantitative and qualitative experiments on publicly available real-world benchmarks show that our approach outperforms existing state-of-the-art methods, achieving significant progress in image quality and noise suppression. The project code is available at https://github.com/hejh8/CFWD.

Low-light Image Enhancement via CLIP-Fourier Guided Wavelet Diffusion

TL;DR

This work tackles unstable and visually unsatisfactory low-light image enhancement by proposing CFWD, a diffusion-based method guided by multimodal CLIP semantics in a frequency-domain wavelet space. It combines a Wavelet Diffusion Model with a Multiscale Visual-Language Guidance Network and a High Frequency Perception Module to constrain content diversity and preserve fine details, using a composite loss that includes diffusion, spectral, and content terms. The approach yields state-of-the-art quantitative gains and superior perceptual quality across diverse real-world benchmarks, including high-resolution backlit scenes, while maintaining generalization to unseen conditions. The framework demonstrates the practical impact of fusing multimodal semantics with spectral-domain diffusion for robust, perceptually faithful low-light enhancement.

Abstract

Low-light image enhancement techniques have significantly progressed, but unstable image quality recovery and unsatisfactory visual perception are still significant challenges. To solve these problems, we propose a novel and robust low-light image enhancement method via CLIP-Fourier Guided Wavelet Diffusion, abbreviated as CFWD. Specifically, CFWD leverages multimodal visual-language information in the frequency domain space created by multiple wavelet transforms to guide the enhancement process. Multi-scale supervision across different modalities facilitates the alignment of image features with semantic features during the wavelet diffusion process, effectively bridging the gap between degraded and normal domains. Moreover, to further promote the effective recovery of the image details, we combine the Fourier transform based on the wavelet transform and construct a Hybrid High Frequency Perception Module (HFPM) with a significant perception of the detailed features. This module avoids the diversity confusion of the wavelet diffusion process by guiding the fine-grained structure recovery of the enhancement results to achieve favourable metric and perceptually oriented enhancement. Extensive quantitative and qualitative experiments on publicly available real-world benchmarks show that our approach outperforms existing state-of-the-art methods, achieving significant progress in image quality and noise suppression. The project code is available at https://github.com/hejh8/CFWD.
Paper Structure (17 sections, 18 equations, 8 figures, 5 tables)

This paper contains 17 sections, 18 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Visual comparison of our method with recent state-of-the-art methods. Other methods suffer from contrast degradation and noise artifacts. our method has the best visual perception.
  • Figure 2: Representative visual examples by enhancing low-light images using CFWD. All of these images have either 2k resolution or 4k resolution.
  • Figure 3: The overall workflow of our proposed CFWD. It first transforms the low-light input $I_L$ and normal image $I_H$ to the wavelet low-frequency domain $(A)$ for diffusion inference via the K-discrete wavelet transform (K-DWT). We embed a multiscale visual guidance network to iteratively perform appearance guidance and content constraints by combining multiple wavelet domains in the inference process. In addition, the decomposed three high-frequency information $\{V_L, H_L, D_L\}$ we effectively augment by a high-frequency perception module (HFPM). Finally, the final enhancement result $I_E$ is obtained by inverse discrete wavelet transform ( K-IDWT).
  • Figure 4: Detailed architecture of our proposed High Frequency Perception Module (HFPM). DS Conv denotes depth-wise separable convolution, and DFT denotes Discrete Fourier Transform.
  • Figure 5: The multiscale visual-language guidance network gradually promotes the alignment of image features with the positive prompts $T_p$ and continuously moves away from the negative prompts $T_n$. Stage 1 indicates without visual-language guidance.
  • ...and 3 more figures