Table of Contents
Fetching ...

VL-UR: Vision-Language-guided Universal Restoration of Images Degraded by Adverse Weather Conditions

Ziyan Liu, Yuxu Lu, Huashan Yu, Dong yang

TL;DR

This work introduces VL-UR, a universal image restoration framework that jointly leverages CLIP-based vision-language priors and a degradation-aware scene classifier to restore images degraded by diverse weather conditions. The system combines a frozen CLIP SC with a Transformer-based SR, using Cross-Transformer Aggregation and a prompt-guided attention mechanism to fuse semantic text and image cues across eleven degradation types. A hybrid loss combining Smooth L1, MS-SSIM, and a CDRL-like feature separation term drives pixel, structural, and feature-level optimization, achieving state-of-the-art results on the CDD-11 dataset with efficient, near real-time performance. The approach enables robust, adaptive restoration suitable for real-world applications such as autonomous driving and surveillance, where degradations are often complex and non-uniform.

Abstract

Image restoration is critical for improving the quality of degraded images, which is vital for applications like autonomous driving, security surveillance, and digital content enhancement. However, existing methods are often tailored to specific degradation scenarios, limiting their adaptability to the diverse and complex challenges in real-world environments. Moreover, real-world degradations are typically non-uniform, highlighting the need for adaptive and intelligent solutions. To address these issues, we propose a novel vision-language-guided universal restoration (VL-UR) framework. VL-UR leverages a zero-shot contrastive language-image pre-training (CLIP) model to enhance image restoration by integrating visual and semantic information. A scene classifier is introduced to adapt CLIP, generating high-quality language embeddings aligned with degraded images while predicting degraded types for complex scenarios. Extensive experiments across eleven diverse degradation settings demonstrate VL-UR's state-of-the-art performance, robustness, and adaptability. This positions VL-UR as a transformative solution for modern image restoration challenges in dynamic, real-world environments.

VL-UR: Vision-Language-guided Universal Restoration of Images Degraded by Adverse Weather Conditions

TL;DR

This work introduces VL-UR, a universal image restoration framework that jointly leverages CLIP-based vision-language priors and a degradation-aware scene classifier to restore images degraded by diverse weather conditions. The system combines a frozen CLIP SC with a Transformer-based SR, using Cross-Transformer Aggregation and a prompt-guided attention mechanism to fuse semantic text and image cues across eleven degradation types. A hybrid loss combining Smooth L1, MS-SSIM, and a CDRL-like feature separation term drives pixel, structural, and feature-level optimization, achieving state-of-the-art results on the CDD-11 dataset with efficient, near real-time performance. The approach enables robust, adaptive restoration suitable for real-world applications such as autonomous driving and surveillance, where degradations are often complex and non-uniform.

Abstract

Image restoration is critical for improving the quality of degraded images, which is vital for applications like autonomous driving, security surveillance, and digital content enhancement. However, existing methods are often tailored to specific degradation scenarios, limiting their adaptability to the diverse and complex challenges in real-world environments. Moreover, real-world degradations are typically non-uniform, highlighting the need for adaptive and intelligent solutions. To address these issues, we propose a novel vision-language-guided universal restoration (VL-UR) framework. VL-UR leverages a zero-shot contrastive language-image pre-training (CLIP) model to enhance image restoration by integrating visual and semantic information. A scene classifier is introduced to adapt CLIP, generating high-quality language embeddings aligned with degraded images while predicting degraded types for complex scenarios. Extensive experiments across eleven diverse degradation settings demonstrate VL-UR's state-of-the-art performance, robustness, and adaptability. This positions VL-UR as a transformative solution for modern image restoration challenges in dynamic, real-world environments.

Paper Structure

This paper contains 27 sections, 11 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Examples of scene recovery under four different imaging conditions are presented in (a)-(d). The upper triangles in each sub-figure represent the degraded patterns, while the lower triangles display the corresponding restored patterns produced by our method.
  • Figure 2: Overview of our method. VL-UR mainly consists of two parts, a pretrained degraded Scene Classifier (SC) with the frozen CLIP and a Scene Restorer (SR) network guided by SC. The specific structures of SC and SR are shown on the right side. SC is pretrained and then frozen. Subsequently, during the training phase of SR, SC will extract rich semantic information $\mathbf{F}_{\text{text}}$ from the captions corresponding to the images. Then $\mathbf{F}_{\text{text}}$ will enter the PGCA in CTransAgg for restoration. In the picture, $\mathbf{I}_{\text{in}}$, $\mathbf{I}_{\text{out}}$, $\mathbf{I}_{\text{pos}}$, $\mathbf{I}_{\text{neg}}$ respectively represent the input image, the restored image, the corresponding positive sample and the negative sample of other degradation types. These images will later be used to calculate the loss function.
  • Figure 3: The Composition and Inference Process of Zero-shot Degradation Scene Classifier.
  • Figure 4: The structure of the CtransAgg module includes the PGCA and FFN, where the PGCA contains the cross attention and LNorm.
  • Figure 5: Visual comparisons of multi-scene restoration from CDD-11 guo2024onerestore. (a) Low-visibility, restored images, generated by (b) MIRNet zamir2022learning, (c) Fourmer zhou2023fourmer, (d) OKNet cui2024omni, (e) AirNet li2022all, (f) TransW valanarasu2022transweather, (g) Diffusion ozdenizci2023restoring, (h) Prompt potlapalli2024promptir, (i) WGWS zhu2023learning, (j) OneRestore guo2024onerestore, (k) our VL-UR, and (l) Ground Truth, respectively. Zooming in on the image facilitates more precise visual comparison.
  • ...and 1 more figures