Table of Contents
Fetching ...

Photo-Realistic Image Restoration in the Wild with Controlled Vision-Language Models

Ziwei Luo, Fredrik K. Gustafsson, Zheng Zhao, Jens Sjölund, Thomas B. Schön

TL;DR

Restoring HQ images from real world degradations is challenged by diverse and out-of-distribution distortions. The work couples a degradation-aware CLIP (DACLIP) with a synthetic high-order degradation pipeline and a mean-reverting SDE based restoration model (IR-SDE), augmented by an optimal posterior sampling strategy to accelerate and improve restoration. It extends DACLIP to robust multimodal conditioning, derives a tractable posterior $p(x_{t-1}|x_t,x_0) = \mathcal{N}(x_{t-1}| \tilde{\mu}_t(x_t,x_0), \tilde{\beta}_t I)$ with explicit formulas, and validates the approach on wild IR benchmarks and the NTIRE RAIM challenge, showing superior perceptual quality and fidelity. The results indicate that photo-realistic restoration in the wild can be achieved without overreliance on pretrained priors while maintaining fidelity to the input, enabling practical deployment in applications requiring high-quality restoration of real-world images.

Abstract

Though diffusion models have been successfully applied to various image restoration (IR) tasks, their performance is sensitive to the choice of training datasets. Typically, diffusion models trained in specific datasets fail to recover images that have out-of-distribution degradations. To address this problem, this work leverages a capable vision-language model and a synthetic degradation pipeline to learn image restoration in the wild (wild IR). More specifically, all low-quality images are simulated with a synthetic degradation pipeline that contains multiple common degradations such as blur, resize, noise, and JPEG compression. Then we introduce robust training for a degradation-aware CLIP model to extract enriched image content features to assist high-quality image restoration. Our base diffusion model is the image restoration SDE (IR-SDE). Built upon it, we further present a posterior sampling strategy for fast noise-free image generation. We evaluate our model on both synthetic and real-world degradation datasets. Moreover, experiments on the unified image restoration task illustrate that the proposed posterior sampling improves image generation quality for various degradations.

Photo-Realistic Image Restoration in the Wild with Controlled Vision-Language Models

TL;DR

Restoring HQ images from real world degradations is challenged by diverse and out-of-distribution distortions. The work couples a degradation-aware CLIP (DACLIP) with a synthetic high-order degradation pipeline and a mean-reverting SDE based restoration model (IR-SDE), augmented by an optimal posterior sampling strategy to accelerate and improve restoration. It extends DACLIP to robust multimodal conditioning, derives a tractable posterior with explicit formulas, and validates the approach on wild IR benchmarks and the NTIRE RAIM challenge, showing superior perceptual quality and fidelity. The results indicate that photo-realistic restoration in the wild can be achieved without overreliance on pretrained priors while maintaining fidelity to the input, enabling practical deployment in applications requiring high-quality restoration of real-world images.

Abstract

Though diffusion models have been successfully applied to various image restoration (IR) tasks, their performance is sensitive to the choice of training datasets. Typically, diffusion models trained in specific datasets fail to recover images that have out-of-distribution degradations. To address this problem, this work leverages a capable vision-language model and a synthetic degradation pipeline to learn image restoration in the wild (wild IR). More specifically, all low-quality images are simulated with a synthetic degradation pipeline that contains multiple common degradations such as blur, resize, noise, and JPEG compression. Then we introduce robust training for a degradation-aware CLIP model to extract enriched image content features to assist high-quality image restoration. Our base diffusion model is the image restoration SDE (IR-SDE). Built upon it, we further present a posterior sampling strategy for fast noise-free image generation. We evaluate our model on both synthetic and real-world degradation datasets. Moreover, experiments on the unified image restoration task illustrate that the proposed posterior sampling improves image generation quality for various degradations.
Paper Structure (22 sections, 8 equations, 8 figures, 4 tables)

This paper contains 22 sections, 8 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Examples of the synthetic LQ images generated using our proposed degradation pipeline and results produced by our method and other state-of-the-art wild IR approaches: Real-ESRGAN wang2021real, StableSR wang2023exploiting, and SUPIR yu2024scaling. Notably, both StableSR and SUPIR adapt pretrained Stable Diffusion rombach2022highpodell2023sdxl models to image restoration, and SUPIR further leverages textual semantic guidance using LLaVA liu2024visual. The proposed method successfully handles various complex degradations and produces clean and sharp results.
  • Figure 2: Overview of the proposed pipeline for synthetic image degradation. There are three degradation phases adopting the random shuffle strategy. We use different types of filters in blur generation and add the Wiener deconvolution for simulating ringing artifacts similar to the Sinc filter in Real-ESRGAN wang2021real. As a general $\times 1$ image restoration pipeline, we use one 'resize' operation to provide image resolution augmentation, and another resize operation to ensure that all the degraded images are resized back to their original size.
  • Figure 3: Examples of applying Wiener deconvolution to generate ringing artifacts. Compared to the Sinc filter used in Real-ESRGAN wang2021real, the proposed Wiener deconvolution generates more distinct ringing artifacts on textures.
  • Figure 4: The proposed robust degradation-aware CLIP (DACLIP) model. $e^T_c$ and $e^T_d$ are caption and degradation text embeddings, respectively. The embeddings $(e^{LQ}_c$, $e^{LQ}_d)$ are extracted from LQ images, and $e^{HQ}_c$ represents the HQ image embedding extracted from the original CLIP image encoder.
  • Figure 5: Visual comparison of the proposed model with other state-of-the-art photo-realistic image restoration approaches on our synthetic DIV2K agustsson2017ntire dataset. Our method trains the diffusion model from scratch while other approaches leverage pretrained Stable Diffusion models. Note that all methods using Stable Diffusion are prone to generate unrecognizable text, such as for the white shirt in the second row.
  • ...and 3 more figures