Table of Contents
Fetching ...

ECAFormer: Low-light Image Enhancement using Cross Attention

Yudi Ruan, Hao Ma, Weikai Li, Xiao Wang

TL;DR

ECAFormer tackles LLIE by embedding dual streams of visual and semantic features into a U-shaped transformer framework. The key innovation is the Dual Multi-head Self Attention (DMSA) module, which enables cross-feature interaction, and the Cross-Scale DMSA (CSDMSA) that fuses residual and current-layer information across scales. Together with a Visual-Semantic Convolution Module and perceptual plus Charbonnier losses, the approach preserves fine details while improving global illumination, achieving competitive results on multiple benchmarks and a new Traffic-297 dataset. This cross-attention-based architecture highlights the importance of inter-component and cross-layer information exchange for robust LLIE in real-world nighttime scenes.

Abstract

Low-light image enhancement (LLIE) is critical in computer vision. Existing LLIE methods often fail to discover the underlying relationships between different sub-components, causing the loss of complementary information between multiple modules and network layers, ultimately resulting in the loss of image details. To beat this shortage, we design a hierarchical mutual Enhancement via a Cross Attention transformer (ECAFormer), which introduces an architecture that enables concurrent propagation and interaction of multiple features. The model preserves detailed information by introducing a Dual Multi-head self-attention (DMSA), which leverages visual and semantic features across different scales, allowing them to guide and complement each other. Besides, a Cross-Scale DMSA block is introduced to capture the residual connection, integrating cross-layer information to further enhance image detail. Experimental results show that ECAFormer reaches competitive performance across multiple benchmarks, yielding nearly a 3% improvement in PSNR over the suboptimal method, demonstrating the effectiveness of information interaction in LLIE.

ECAFormer: Low-light Image Enhancement using Cross Attention

TL;DR

ECAFormer tackles LLIE by embedding dual streams of visual and semantic features into a U-shaped transformer framework. The key innovation is the Dual Multi-head Self Attention (DMSA) module, which enables cross-feature interaction, and the Cross-Scale DMSA (CSDMSA) that fuses residual and current-layer information across scales. Together with a Visual-Semantic Convolution Module and perceptual plus Charbonnier losses, the approach preserves fine details while improving global illumination, achieving competitive results on multiple benchmarks and a new Traffic-297 dataset. This cross-attention-based architecture highlights the importance of inter-component and cross-layer information exchange for robust LLIE in real-world nighttime scenes.

Abstract

Low-light image enhancement (LLIE) is critical in computer vision. Existing LLIE methods often fail to discover the underlying relationships between different sub-components, causing the loss of complementary information between multiple modules and network layers, ultimately resulting in the loss of image details. To beat this shortage, we design a hierarchical mutual Enhancement via a Cross Attention transformer (ECAFormer), which introduces an architecture that enables concurrent propagation and interaction of multiple features. The model preserves detailed information by introducing a Dual Multi-head self-attention (DMSA), which leverages visual and semantic features across different scales, allowing them to guide and complement each other. Besides, a Cross-Scale DMSA block is introduced to capture the residual connection, integrating cross-layer information to further enhance image detail. Experimental results show that ECAFormer reaches competitive performance across multiple benchmarks, yielding nearly a 3% improvement in PSNR over the suboptimal method, demonstrating the effectiveness of information interaction in LLIE.
Paper Structure (24 sections, 14 equations, 10 figures, 4 tables)

This paper contains 24 sections, 14 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Main architecture comparison. Compared to other methods, our approach utilizes DMSA to simultaneously propagate two features forward and facilitate interactions at different scales. This method is advantageous for extracting latent connections between features.
  • Figure 2: The flowchart of ECAformer mainly consists of three parts: (1) Visual-Semantic Convolution modules \ref{['shallow-deep']}, which output short-range features (visual-features) and long-range features (semantic-features). (2)The U-shaped cross-attention Transformer \ref{['mutual-guidance']} engages with long- and short-term input features through DMSA, concurrently propagating these features. (3) Mapping convolution, where the module projects the interacted features back to image features.${\in\mathbb{R}^{\emph{C}\times\emph{H}\times\emph{W}}}$.
  • Figure 3: From top to bottom are the shallow outputs and deep outputs of the model, respectively.
  • Figure 4: DMSA: A highly symmetrical module is depicted in the figure, which illustrates the process of computing the $[\alpha',\beta']$ guided by each other using cross attention.
  • Figure 5: Three images from SDSD-indoor test set were selected for comparison with different methods. (a) Input Image. (b) Ground Truth. (c) ZeroDCE++. (d) SNRNet. (e) ECAFormer.
  • ...and 5 more figures