Table of Contents
Fetching ...

Vision-Language Gradient Descent-driven All-in-One Deep Unfolding Networks

Haijin Zeng, Xiangming Wang, Yongyong Chen, Jingyong Su, Jie Liu

TL;DR

This work introduces VLU-Net, a first all-in-one deep unfolding network for multi-degradation image restoration that leverages a vision-language model to automatically identify and align degradation types with image features. By fine-tuning CLIP on degraded image-text pairs and integrating a degradation-guided gradient descent (D-GDM) within a hierarchical, feature-level DUN, the approach robustly handles noise, blur, rain, haze, and low-light distortions in a single model. Key contributions include a degradation-aware, VLM-guided transform selection mechanism, a multi-level hierarchical unfolding architecture, and a Transformer-based degradation module that preserves high-dimensional information across stages. Empirical results show superior performance over state-of-the-art one-by-one and all-in-one methods, notably achieving 3.74 dB gains on SOTS dehazing and 1.70 dB gains on Rain100L deraining, indicating strong practical impact for versatile, interpretable IR in real-world scenarios.

Abstract

Dynamic image degradations, including noise, blur and lighting inconsistencies, pose significant challenges in image restoration, often due to sensor limitations or adverse environmental conditions. Existing Deep Unfolding Networks (DUNs) offer stable restoration performance but require manual selection of degradation matrices for each degradation type, limiting their adaptability across diverse scenarios. To address this issue, we propose the Vision-Language-guided Unfolding Network (VLU-Net), a unified DUN framework for handling multiple degradation types simultaneously. VLU-Net leverages a Vision-Language Model (VLM) refined on degraded image-text pairs to align image features with degradation descriptions, selecting the appropriate transform for target degradation. By integrating an automatic VLM-based gradient estimation strategy into the Proximal Gradient Descent (PGD) algorithm, VLU-Net effectively tackles complex multi-degradation restoration tasks while maintaining interpretability. Furthermore, we design a hierarchical feature unfolding structure to enhance VLU-Net framework, efficiently synthesizing degradation patterns across various levels. VLU-Net is the first all-in-one DUN framework and outperforms current leading one-by-one and all-in-one end-to-end methods by 3.74 dB on the SOTS dehazing dataset and 1.70 dB on the Rain100L deraining dataset.

Vision-Language Gradient Descent-driven All-in-One Deep Unfolding Networks

TL;DR

This work introduces VLU-Net, a first all-in-one deep unfolding network for multi-degradation image restoration that leverages a vision-language model to automatically identify and align degradation types with image features. By fine-tuning CLIP on degraded image-text pairs and integrating a degradation-guided gradient descent (D-GDM) within a hierarchical, feature-level DUN, the approach robustly handles noise, blur, rain, haze, and low-light distortions in a single model. Key contributions include a degradation-aware, VLM-guided transform selection mechanism, a multi-level hierarchical unfolding architecture, and a Transformer-based degradation module that preserves high-dimensional information across stages. Empirical results show superior performance over state-of-the-art one-by-one and all-in-one methods, notably achieving 3.74 dB gains on SOTS dehazing and 1.70 dB gains on Rain100L deraining, indicating strong practical impact for versatile, interpretable IR in real-world scenarios.

Abstract

Dynamic image degradations, including noise, blur and lighting inconsistencies, pose significant challenges in image restoration, often due to sensor limitations or adverse environmental conditions. Existing Deep Unfolding Networks (DUNs) offer stable restoration performance but require manual selection of degradation matrices for each degradation type, limiting their adaptability across diverse scenarios. To address this issue, we propose the Vision-Language-guided Unfolding Network (VLU-Net), a unified DUN framework for handling multiple degradation types simultaneously. VLU-Net leverages a Vision-Language Model (VLM) refined on degraded image-text pairs to align image features with degradation descriptions, selecting the appropriate transform for target degradation. By integrating an automatic VLM-based gradient estimation strategy into the Proximal Gradient Descent (PGD) algorithm, VLU-Net effectively tackles complex multi-degradation restoration tasks while maintaining interpretability. Furthermore, we design a hierarchical feature unfolding structure to enhance VLU-Net framework, efficiently synthesizing degradation patterns across various levels. VLU-Net is the first all-in-one DUN framework and outperforms current leading one-by-one and all-in-one end-to-end methods by 3.74 dB on the SOTS dehazing dataset and 1.70 dB on the Rain100L deraining dataset.

Paper Structure

This paper contains 26 sections, 13 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: VLU-Net integrates an automatic gradient estimation strategy based on VLMs, enhancing existing DUNs which can only handle single task. This allows for simultaneous handling of multiple degradation without manual transform selection. Its hierarchical multi-stage unfolding structure efficiently synthesizes degradation patterns across various levels and stages.
  • Figure 2: Overview of our VLU-Net, an all-in-one hierarchical DUN for multiple degradations, including fine-tuning of CLIP and the primary IR process. Fine-tuning phase employs contrastive learning with degradation image-text pairs, while the IR process involves a $\mathbf{K}$-stage DUN with $l$ levels, projection and back projection. Each stage utilizes a Degradation-guided GDM (D-GDM) and PMM to adaptively handle degraded inputs and maintain multi-level information across stages. Noted detailed PMM is presented in supplementary materials.
  • Figure 3: Visual comparison with state-of-the-art under NHR settings.
  • Figure 4: Visual degradation from D-GDM for rain and haze.
  • Figure 5: Heat-maps for CLIP before (Left) and after (Right) fine-tuning. Top is for NHR setting and bottom is for NHRBL setting.
  • ...and 6 more figures