Photo-Realistic Image Restoration in the Wild with Controlled Vision-Language Models
Ziwei Luo, Fredrik K. Gustafsson, Zheng Zhao, Jens Sjölund, Thomas B. Schön
TL;DR
Restoring HQ images from real world degradations is challenged by diverse and out-of-distribution distortions. The work couples a degradation-aware CLIP (DACLIP) with a synthetic high-order degradation pipeline and a mean-reverting SDE based restoration model (IR-SDE), augmented by an optimal posterior sampling strategy to accelerate and improve restoration. It extends DACLIP to robust multimodal conditioning, derives a tractable posterior $p(x_{t-1}|x_t,x_0) = \mathcal{N}(x_{t-1}| \tilde{\mu}_t(x_t,x_0), \tilde{\beta}_t I)$ with explicit formulas, and validates the approach on wild IR benchmarks and the NTIRE RAIM challenge, showing superior perceptual quality and fidelity. The results indicate that photo-realistic restoration in the wild can be achieved without overreliance on pretrained priors while maintaining fidelity to the input, enabling practical deployment in applications requiring high-quality restoration of real-world images.
Abstract
Though diffusion models have been successfully applied to various image restoration (IR) tasks, their performance is sensitive to the choice of training datasets. Typically, diffusion models trained in specific datasets fail to recover images that have out-of-distribution degradations. To address this problem, this work leverages a capable vision-language model and a synthetic degradation pipeline to learn image restoration in the wild (wild IR). More specifically, all low-quality images are simulated with a synthetic degradation pipeline that contains multiple common degradations such as blur, resize, noise, and JPEG compression. Then we introduce robust training for a degradation-aware CLIP model to extract enriched image content features to assist high-quality image restoration. Our base diffusion model is the image restoration SDE (IR-SDE). Built upon it, we further present a posterior sampling strategy for fast noise-free image generation. We evaluate our model on both synthetic and real-world degradation datasets. Moreover, experiments on the unified image restoration task illustrate that the proposed posterior sampling improves image generation quality for various degradations.
