Table of Contents
Fetching ...

Boosting Image Restoration via Priors from Pre-trained Models

Xiaogang Xu, Shu Kong, Tao Hu, Zhe Liu, Hujun Bao

TL;DR

Extensive experiments demonstrate that PTG-RM, with its compact size (<1M parameters), effectively enhances restoration performance of various models across different tasks, including low-light enhancement, deraining, deblurring, and denoising.

Abstract

Pre-trained models with large-scale training data, such as CLIP and Stable Diffusion, have demonstrated remarkable performance in various high-level computer vision tasks such as image understanding and generation from language descriptions. Yet, their potential for low-level tasks such as image restoration remains relatively unexplored. In this paper, we explore such models to enhance image restoration. As off-the-shelf features (OSF) from pre-trained models do not directly serve image restoration, we propose to learn an additional lightweight module called Pre-Train-Guided Refinement Module (PTG-RM) to refine restoration results of a target restoration network with OSF. PTG-RM consists of two components, Pre-Train-Guided Spatial-Varying Enhancement (PTG-SVE), and Pre-Train-Guided Channel-Spatial Attention (PTG-CSA). PTG-SVE enables optimal short- and long-range neural operations, while PTG-CSA enhances spatial-channel attention for restoration-related learning. Extensive experiments demonstrate that PTG-RM, with its compact size ($<$1M parameters), effectively enhances restoration performance of various models across different tasks, including low-light enhancement, deraining, deblurring, and denoising.

Boosting Image Restoration via Priors from Pre-trained Models

TL;DR

Extensive experiments demonstrate that PTG-RM, with its compact size (<1M parameters), effectively enhances restoration performance of various models across different tasks, including low-light enhancement, deraining, deblurring, and denoising.

Abstract

Pre-trained models with large-scale training data, such as CLIP and Stable Diffusion, have demonstrated remarkable performance in various high-level computer vision tasks such as image understanding and generation from language descriptions. Yet, their potential for low-level tasks such as image restoration remains relatively unexplored. In this paper, we explore such models to enhance image restoration. As off-the-shelf features (OSF) from pre-trained models do not directly serve image restoration, we propose to learn an additional lightweight module called Pre-Train-Guided Refinement Module (PTG-RM) to refine restoration results of a target restoration network with OSF. PTG-RM consists of two components, Pre-Train-Guided Spatial-Varying Enhancement (PTG-SVE), and Pre-Train-Guided Channel-Spatial Attention (PTG-CSA). PTG-SVE enables optimal short- and long-range neural operations, while PTG-CSA enhances spatial-channel attention for restoration-related learning. Extensive experiments demonstrate that PTG-RM, with its compact size (1M parameters), effectively enhances restoration performance of various models across different tasks, including low-light enhancement, deraining, deblurring, and denoising.
Paper Structure (12 sections, 8 equations, 7 figures, 8 tables)

This paper contains 12 sections, 8 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Our method leverages pre-trained models, such as CLIP and Stable Diffusion, and significantly improves image restoration across various tasks. More results on different tasks/models can be seen in experiments. Pre-trained models are involved during the training and not required during the inference.
  • Figure 2: We present a lightweight plugin, pre-training guided refining module (PTG-RM), to leverage pre-trained models for enhancing image restoration. The desired prior is the OFS $\mathcal{G}(I_d)$. It has two components, PTG spatial varying enhancement (PTG-SVE), and PTG channel-spatial attention (PTG-CSA). Fig. \ref{['fig:framework']} depicts their details. Our PTG-RM significantly improves restoration in various tasks as listed in the top-right (see quantitative results previewed in Fig. \ref{['fig:teaser']}).
  • Figure 3: The pipeline of PTG-SVE and PTG-CSA. In PTG-SVE, we use the learnable spatial embedding $\mathcal{S}_m$, OSF $g$, and input feature $f$ to adaptively formulate spatial weights ($M$, Eq. \ref{['eq:mra']}) for fusing short- and long-range processed features ($f_s$ and $f_l$) via operations $\mathcal{R}_s$ and $\mathcal{R}_l$, yielding $\hat{f}$ (Eq. \ref{['eq:fsl']}). In PTG-CSA, OSF $g$ conditions channel attention $M_c$ for $\hat{f}$ through $\mathcal{R}_c$ (Eq. \ref{['eq:mc']}). Additionally, $g$ combines with learnable spatial representation $\mathcal{S}_c$ and $\hat{f}$ to generate spatial attention map $M_s$, using spatial-wise convolutions $C_p$ (obtained via $\mathcal{R}_p$) to derive $\hat{\mathcal{M}}_s$ that is further processed with $\mathcal{R}_o$ (Eqs. \ref{['eq:cp']} and \ref{['eq:ms']}). Channel- and spatial-attention outputs ($\hat{f}_c$ and $\hat{f}_s$) merge via $\mathcal{R}_f$ to enhance feature $\bar{f}$ (Eq. \ref{['eq:barf']}).
  • Figure 4: Comparisons on LOL-real (top) and SID (bottom). Results with "Ours" have less noise and clearer visibility.
  • Figure 5: Visual comparison on Rain100H showing the effects of our strategy.
  • ...and 2 more figures