Table of Contents
Fetching ...

Leveraging Content and Context Cues for Low-Light Image Enhancement

Igor Morawski, Kai He, Shusil Dangi, Winston H. Hsu

TL;DR

The paper tackles the challenge of improving machine cognition under low-light by enhancing images without relying on paired normal-light data. It introduces a two-stage CLIP-based framework: (i) unsupervised image-prior learning via prompt learning and (ii) semantic-guided, zero-reference low-light enhancement, leveraging content and context cues. Extensive ablations and cross-dataset evaluations show consistent task-based gains in recognition, detection, and related downstream tasks, highlighting that restoration quality does not always correlate with cognition performance. The approach is lightweight during inference, generalizes to multiple baselines, and underscores the importance of optimizing image processing for downstream tasks rather than human perceptual quality alone.

Abstract

Low-light conditions have an adverse impact on machine cognition, limiting the performance of computer vision systems in real life. Since low-light data is limited and difficult to annotate, we focus on image processing to enhance low-light images and improve the performance of any downstream task model, instead of fine-tuning each of the models which can be prohibitively expensive. We propose to improve the existing zero-reference low-light enhancement by leveraging the CLIP model to capture image prior and for semantic guidance. Specifically, we propose a data augmentation strategy to learn an image prior via prompt learning, based on image sampling, to learn the image prior without any need for paired or unpaired normal-light data. Next, we propose a semantic guidance strategy that maximally takes advantage of existing low-light annotation by introducing both content and context cues about the image training patches. We experimentally show, in a qualitative study, that the proposed prior and semantic guidance help to improve the overall image contrast and hue, as well as improve background-foreground discrimination, resulting in reduced over-saturation and noise over-amplification, common in related zero-reference methods. As we target machine cognition, rather than rely on assuming the correlation between human perception and downstream task performance, we conduct and present an ablation study and comparison with related zero-reference methods in terms of task-based performance across many low-light datasets, including image classification, object and face detection, showing the effectiveness of our proposed method.

Leveraging Content and Context Cues for Low-Light Image Enhancement

TL;DR

The paper tackles the challenge of improving machine cognition under low-light by enhancing images without relying on paired normal-light data. It introduces a two-stage CLIP-based framework: (i) unsupervised image-prior learning via prompt learning and (ii) semantic-guided, zero-reference low-light enhancement, leveraging content and context cues. Extensive ablations and cross-dataset evaluations show consistent task-based gains in recognition, detection, and related downstream tasks, highlighting that restoration quality does not always correlate with cognition performance. The approach is lightweight during inference, generalizes to multiple baselines, and underscores the importance of optimizing image processing for downstream tasks rather than human perceptual quality alone.

Abstract

Low-light conditions have an adverse impact on machine cognition, limiting the performance of computer vision systems in real life. Since low-light data is limited and difficult to annotate, we focus on image processing to enhance low-light images and improve the performance of any downstream task model, instead of fine-tuning each of the models which can be prohibitively expensive. We propose to improve the existing zero-reference low-light enhancement by leveraging the CLIP model to capture image prior and for semantic guidance. Specifically, we propose a data augmentation strategy to learn an image prior via prompt learning, based on image sampling, to learn the image prior without any need for paired or unpaired normal-light data. Next, we propose a semantic guidance strategy that maximally takes advantage of existing low-light annotation by introducing both content and context cues about the image training patches. We experimentally show, in a qualitative study, that the proposed prior and semantic guidance help to improve the overall image contrast and hue, as well as improve background-foreground discrimination, resulting in reduced over-saturation and noise over-amplification, common in related zero-reference methods. As we target machine cognition, rather than rely on assuming the correlation between human perception and downstream task performance, we conduct and present an ablation study and comparison with related zero-reference methods in terms of task-based performance across many low-light datasets, including image classification, object and face detection, showing the effectiveness of our proposed method.

Paper Structure

This paper contains 19 sections, 10 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: As we target machine cognition, rather than rely on assuming the correlation between human perception and downstream task performance, we conduct and present an ablation study and comparison with related zero-reference methods in terms of task-based performance across many low-light datasets, including image classification, object and face detection, showing the effectiveness of our proposed method.
  • Figure 2: Our proposed method. In the first stage, we propose to learn the positive and negative image priors without any need for paired or unpaired normal-light data. Next, we train the image enhancement model, using zero-reference image losses, image prior prompts and semantic guidance. We propose to maximally use existing image annotation, by using cues about both the content and context of an image patch, that is, about the objects within and outside the patch. The two guidance steps work synergistically to improve the performance and, at the same time, only increase computational complexity at the training time, without incurring any additional costs during inference. Our proposed method leverages the CLIP model and its zero-shot capabilities to scale favorably to include various datasets with any limitations on annotated object categories, instead of fixing the training category set.
  • Figure 3: Since our motivation is to make the maximal use of the existing low-light annotation, which may be difficult or costly to obtain, we apply two separate steps which use different information for semantic guidance. Instead of realizing semantic guidance by training on object instances and corresponding labels, we propose to train on patches sampled from the image, together with descriptions of the scene within and outside the patch. In the first step, within a training batch, we match image description to image, and in the second step, similarly, we match descriptions of objects extending outside of the image to each image.
  • Figure 4: We use image sampling to augment positive and negative image prompt pair. The positive sample (on the left) is augmented using $4 \times 4$ averaging, acting as a fast and simple proxy for denoising, and the negative sample (on the right) is augmented using $1:4$ subsampling, preserving the noise in the image. Later, we use the learned prompt pair to help guide the enhancement model, leading to improved image contrast, reduced under- and overexposure, and reduced over-amplification of noise.
  • Figure 5: We first use the CLIP model to learn the positive and negative image prior pair using simple data augmentation strategy, based on image resampling, eliminating any need for paired or unpaired normal-light data. The learned prompt pair is later used for guiding the enhancement model. We experimentally show that the proposed prompts help to guide the image enhancement model by improving the overall image contrast, reducing under- and overexposure leading to decreased information loss, and reducing over-amplification of noise common in unsupervised enhancement models.
  • ...and 7 more figures