Table of Contents
Fetching ...

Unsupervised Image Prior via Prompt Learning and CLIP Semantic Guidance for Low-Light Image Enhancement

Igor Morawski, Kai He, Shusil Dangi, Winston H. Hsu

TL;DR

The paper tackles the challenge of robust machine cognition under low-light without relying on paired normal-light data. It introduces a two-stage training approach that uses learnable prompts to construct an image prior and CLIP-based semantic guidance to optimize enhancement for downstream tasks. The method combines a lightweight, zero-reference enhancement baseline with prompt-based priors and open-vocabulary CLIP supervision, achieving consistent gains across diverse datasets. Its open-vocabulary open-detection strategy and CLIP guidance make the approach scalable to unlimited object categories and annotation availability, improving task-based performance with minimal extra computation.

Abstract

Currently, low-light conditions present a significant challenge for machine cognition. In this paper, rather than optimizing models by assuming that human and machine cognition are correlated, we use zero-reference low-light enhancement to improve the performance of downstream task models. We propose to improve the zero-reference low-light enhancement method by leveraging the rich visual-linguistic CLIP prior without any need for paired or unpaired normal-light data, which is laborious and difficult to collect. We propose a simple but effective strategy to learn prompts that help guide the enhancement method and experimentally show that the prompts learned without any need for normal-light data improve image contrast, reduce over-enhancement, and reduce noise over-amplification. Next, we propose to reuse the CLIP model for semantic guidance via zero-shot open vocabulary classification to optimize low-light enhancement for task-based performance rather than human visual perception. We conduct extensive experimental results showing that the proposed method leads to consistent improvements across various datasets regarding task-based performance and compare our method against state-of-the-art methods, showing favorable results across various low-light datasets.

Unsupervised Image Prior via Prompt Learning and CLIP Semantic Guidance for Low-Light Image Enhancement

TL;DR

The paper tackles the challenge of robust machine cognition under low-light without relying on paired normal-light data. It introduces a two-stage training approach that uses learnable prompts to construct an image prior and CLIP-based semantic guidance to optimize enhancement for downstream tasks. The method combines a lightweight, zero-reference enhancement baseline with prompt-based priors and open-vocabulary CLIP supervision, achieving consistent gains across diverse datasets. Its open-vocabulary open-detection strategy and CLIP guidance make the approach scalable to unlimited object categories and annotation availability, improving task-based performance with minimal extra computation.

Abstract

Currently, low-light conditions present a significant challenge for machine cognition. In this paper, rather than optimizing models by assuming that human and machine cognition are correlated, we use zero-reference low-light enhancement to improve the performance of downstream task models. We propose to improve the zero-reference low-light enhancement method by leveraging the rich visual-linguistic CLIP prior without any need for paired or unpaired normal-light data, which is laborious and difficult to collect. We propose a simple but effective strategy to learn prompts that help guide the enhancement method and experimentally show that the prompts learned without any need for normal-light data improve image contrast, reduce over-enhancement, and reduce noise over-amplification. Next, we propose to reuse the CLIP model for semantic guidance via zero-shot open vocabulary classification to optimize low-light enhancement for task-based performance rather than human visual perception. We conduct extensive experimental results showing that the proposed method leads to consistent improvements across various datasets regarding task-based performance and compare our method against state-of-the-art methods, showing favorable results across various low-light datasets.
Paper Structure (14 sections, 10 equations, 6 figures, 2 tables)

This paper contains 14 sections, 10 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Our proposed method leverages the CLIP radford2021learning model for unsupervised image prior via prompt learning and open-vocabulary semantic guidance. Our proposed method improves the over-all image hue, reduces over-enhancement and reduces noise over-amplification. Further, we conduct extensive experiments to show that our proposed method significantly improves machine cognition as measured by task-based performance of down-stream tasks models, without incurring any additional computation costs on the light-weight enhancement baseline model guo2020zero.
  • Figure 2: Our proposed two-stage training process leverages the pre-trained CLIP model that can capture lighting conditions and quality of images. We propose to use the CLIP model to learn the positive and negative image priors with a simple data augmentation strategy without any need for paired or unpaired normal-light data via prompt learning, and use them for guiding the image enhancement model. During the training, we use the learned prompts and reuse the CLIP model for semantic guidance to improve the quality of the enhanced images. Our proposed method uses open-vocabulary classification, so it can be easily extended to any dataset, without limiting object categories, with annotated bounding boxes or any type of annotation that can be used to extract patches with an object category, as well as to paired low- and normal-light datasets, increasing the variety of the data in the training.
  • Figure 3: Statistics of the datasets used for the ablation study.
  • Figure 4: Our strategy to learn positive and negative image prompts: $1:4$ subsampled (left) negative and $4 \times 4$ averaged (right) positive image. Subsampling preserves the noise in the image while averaging acts as a fast and simply proxy for denoising.
  • Figure 5: Ablation study of our proposed method. Semantic guidance improves the color distribution of the images, while the learned prompt improves image contrast, reduces overexposure and reduces over-amplification of the noise.
  • ...and 1 more figures