Table of Contents
Fetching ...

Natural Language Supervision for Low-light Image Enhancement

Jiahui Tang, Kaihua Zhou, Zhijian Luo, Yueen Hou

TL;DR

The paper addresses low-light image enhancement where traditional references vary with illumination, making a single perfect reference impractical. It introduces NaLSuper, a Natural Language Supervision network that learns image representations guided by textual descriptions through a Textual Guidance Conditioning Mechanism (TCM) and an Information Fusion Attention (IFA) module. TCM enables cross-modal and intra-modal alignment between image regions and sentence words via cross-attention, while IFA fuses multi-level image/text cues using channel, pixel, and cross-layer attention. Experiments on LOLv1, LOLv2, and SID demonstrate state-of-the-art performance in both objective metrics and perceptual quality, highlighting the robustness and practical impact of cross-modal supervision for LLIE.

Abstract

With the development of deep learning, numerous methods for low-light image enhancement (LLIE) have demonstrated remarkable performance. Mainstream LLIE methods typically learn an end-to-end mapping based on pairs of low-light and normal-light images. However, normal-light images under varying illumination conditions serve as reference images, making it difficult to define a ``perfect'' reference image This leads to the challenge of reconciling metric-oriented and visual-friendly results. Recently, many cross-modal studies have found that side information from other related modalities can guide visual representation learning. Based on this, we introduce a Natural Language Supervision (NLS) strategy, which learns feature maps from text corresponding to images, offering a general and flexible interface for describing an image under different illumination. However, image distributions conditioned on textual descriptions are highly multimodal, which makes training difficult. To address this issue, we design a Textual Guidance Conditioning Mechanism (TCM) that incorporates the connections between image regions and sentence words, enhancing the ability to capture fine-grained cross-modal cues for images and text. This strategy not only utilizes a wider range of supervised sources, but also provides a new paradigm for LLIE based on visual and textual feature alignment. In order to effectively identify and merge features from various levels of image and textual information, we design an Information Fusion Attention (IFA) module to enhance different regions at different levels. We integrate the proposed TCM and IFA into a Natural Language Supervision network for LLIE, named NaLSuper. Finally, extensive experiments demonstrate the robustness and superior effectiveness of our proposed NaLSuper.

Natural Language Supervision for Low-light Image Enhancement

TL;DR

The paper addresses low-light image enhancement where traditional references vary with illumination, making a single perfect reference impractical. It introduces NaLSuper, a Natural Language Supervision network that learns image representations guided by textual descriptions through a Textual Guidance Conditioning Mechanism (TCM) and an Information Fusion Attention (IFA) module. TCM enables cross-modal and intra-modal alignment between image regions and sentence words via cross-attention, while IFA fuses multi-level image/text cues using channel, pixel, and cross-layer attention. Experiments on LOLv1, LOLv2, and SID demonstrate state-of-the-art performance in both objective metrics and perceptual quality, highlighting the robustness and practical impact of cross-modal supervision for LLIE.

Abstract

With the development of deep learning, numerous methods for low-light image enhancement (LLIE) have demonstrated remarkable performance. Mainstream LLIE methods typically learn an end-to-end mapping based on pairs of low-light and normal-light images. However, normal-light images under varying illumination conditions serve as reference images, making it difficult to define a ``perfect'' reference image This leads to the challenge of reconciling metric-oriented and visual-friendly results. Recently, many cross-modal studies have found that side information from other related modalities can guide visual representation learning. Based on this, we introduce a Natural Language Supervision (NLS) strategy, which learns feature maps from text corresponding to images, offering a general and flexible interface for describing an image under different illumination. However, image distributions conditioned on textual descriptions are highly multimodal, which makes training difficult. To address this issue, we design a Textual Guidance Conditioning Mechanism (TCM) that incorporates the connections between image regions and sentence words, enhancing the ability to capture fine-grained cross-modal cues for images and text. This strategy not only utilizes a wider range of supervised sources, but also provides a new paradigm for LLIE based on visual and textual feature alignment. In order to effectively identify and merge features from various levels of image and textual information, we design an Information Fusion Attention (IFA) module to enhance different regions at different levels. We integrate the proposed TCM and IFA into a Natural Language Supervision network for LLIE, named NaLSuper. Finally, extensive experiments demonstrate the robustness and superior effectiveness of our proposed NaLSuper.
Paper Structure (24 sections, 10 equations, 10 figures, 5 tables)

This paper contains 24 sections, 10 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Comparison with state-of-the-art methods on LOLv1 dataset. It is evident that we have restored more authentic colors and visually appealing content.
  • Figure 2: Overview architecture of our proposed NaLSuper. NaLSuper is a Natural Language Supervision network for LLIE, which incorporates Textual Guidance Conditioning Mechanism (TCM) and Information Fusion Attention (IFA) modules. The final estimation outputted by the reconstruction part and global residual learning structure, which is considered to be the desired normal-light image.
  • Figure 3: Overview architecture of Information Fusion Attention (IFA).
  • Figure 4: Overview architecture of Cross-layer Attention Fusion Block(CAFB).
  • Figure 5: Visual comparison with LLIE methods on LOLv1 dataset.
  • ...and 5 more figures