Table of Contents
Fetching ...

DAP-LED: Learning Degradation-Aware Priors with CLIP for Joint Low-light Enhancement and Deblurring

Ling Wang, Chen Wu, Lin Wang

TL;DR

DAP-LED introduces a degradation-aware, CLIP-guided framework for jointly solving low-light enhancement and deblurring with a $4$-level transformer encoder-decoder. It employs a CLIP-guided cross-fusion module to generate multi-scale degradation heatmaps and CLIP-enhanced transformer blocks to fuse degradation information into restoration, optimized with a Charbonnier reconstruction loss plus CLIP-based priors ($L_{identity}$ and $L_{clip}$). The method achieves state-of-the-art restoration on the LOL-Blur suite and real night-blurred images, and demonstrably improves downstream tasks such as depth estimation, segmentation, and object detection. This work highlights the practical utility of CLIP priors for degradation-aware, joint restoration in nighttime robotics and autonomous systems.

Abstract

Autonomous vehicles and robots often struggle with reliable visual perception at night due to the low illumination and motion blur caused by the long exposure time of RGB cameras. Existing methods address this challenge by sequentially connecting the off-the-shelf pretrained low-light enhancement and deblurring models. Unfortunately, these methods often lead to noticeable artifacts (\eg, color distortions) in the over-exposed regions or make it hardly possible to learn the motion cues of the dark regions. In this paper, we interestingly find vision-language models, \eg, Contrastive Language-Image Pretraining (CLIP), can comprehensively perceive diverse degradation levels at night. In light of this, we propose a novel transformer-based joint learning framework, named DAP-LED, which can jointly achieve low-light enhancement and deblurring, benefiting downstream tasks, such as depth estimation, segmentation, and detection in the dark. The key insight is to leverage CLIP to adaptively learn the degradation levels from images at night. This subtly enables learning rich semantic information and visual representation for optimization of the joint tasks. To achieve this, we first introduce a CLIP-guided cross-fusion module to obtain multi-scale patch-wise degradation heatmaps from the image embeddings. Then, the heatmaps are fused via the designed CLIP-enhanced transformer blocks to retain useful degradation information for effective model optimization. Experimental results show that, compared to existing methods, our DAP-LED achieves state-of-the-art performance in the dark. Meanwhile, the enhanced results are demonstrated to be effective for three downstream tasks. For demo and more results, please check the project page: \url{https://vlislab22.github.io/dap-led/}.

DAP-LED: Learning Degradation-Aware Priors with CLIP for Joint Low-light Enhancement and Deblurring

TL;DR

DAP-LED introduces a degradation-aware, CLIP-guided framework for jointly solving low-light enhancement and deblurring with a -level transformer encoder-decoder. It employs a CLIP-guided cross-fusion module to generate multi-scale degradation heatmaps and CLIP-enhanced transformer blocks to fuse degradation information into restoration, optimized with a Charbonnier reconstruction loss plus CLIP-based priors ( and ). The method achieves state-of-the-art restoration on the LOL-Blur suite and real night-blurred images, and demonstrably improves downstream tasks such as depth estimation, segmentation, and object detection. This work highlights the practical utility of CLIP priors for degradation-aware, joint restoration in nighttime robotics and autonomous systems.

Abstract

Autonomous vehicles and robots often struggle with reliable visual perception at night due to the low illumination and motion blur caused by the long exposure time of RGB cameras. Existing methods address this challenge by sequentially connecting the off-the-shelf pretrained low-light enhancement and deblurring models. Unfortunately, these methods often lead to noticeable artifacts (\eg, color distortions) in the over-exposed regions or make it hardly possible to learn the motion cues of the dark regions. In this paper, we interestingly find vision-language models, \eg, Contrastive Language-Image Pretraining (CLIP), can comprehensively perceive diverse degradation levels at night. In light of this, we propose a novel transformer-based joint learning framework, named DAP-LED, which can jointly achieve low-light enhancement and deblurring, benefiting downstream tasks, such as depth estimation, segmentation, and detection in the dark. The key insight is to leverage CLIP to adaptively learn the degradation levels from images at night. This subtly enables learning rich semantic information and visual representation for optimization of the joint tasks. To achieve this, we first introduce a CLIP-guided cross-fusion module to obtain multi-scale patch-wise degradation heatmaps from the image embeddings. Then, the heatmaps are fused via the designed CLIP-enhanced transformer blocks to retain useful degradation information for effective model optimization. Experimental results show that, compared to existing methods, our DAP-LED achieves state-of-the-art performance in the dark. Meanwhile, the enhanced results are demonstrated to be effective for three downstream tasks. For demo and more results, please check the project page: \url{https://vlislab22.github.io/dap-led/}.
Paper Structure (14 sections, 5 equations, 8 figures, 4 tables)

This paper contains 14 sections, 5 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Comparison of depth estimation depthanythingv2 and object detection detr results with night-blurred input (top) and our enhanced image as input (bottom).
  • Figure 2: Our motivation. Here are some similarity scores between the image and text descriptions derived from the CLIP-L/14 model’s zero-shot inference. Despite some misclassifications in single degradation categories (e.g., the high score for "bright" in a low-light image), CLIP demonstrates stronger joint degradation perception capabilities that it identifies the combined feature between low-light and blurriness in multiple images.
  • Figure 3: Overview of our proposed DAP-LED framework. The overall framework utilizes a 4-level transformer-based symmetric encoder-decoder structure and incorporates multiple CLIP-enhanced Transformer Blocks (CeTBs) at each decoding level. At the beginning, the CLIP-guided Cross-fusion Module (CCM) generates the multi-scale degradation-aware heatmaps, which are used at each level at encoding and decoding stages.
  • Figure 4: Overview of CCM. Here is a visualization of the patch-wise similarity weights between the image and text embeddings. The heatmap on the bottom right shows the model’s joint perception of low-light and blur, as well as single perceptions of low-light and blur individually.
  • Figure 5: Visual quality comparison on the LOL-Blur dataset. Please check the zoom-in patches to observe the details.
  • ...and 3 more figures