Table of Contents
Fetching ...

LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling

Huaqiu Li, Yong Wang, Tongwen Huang, Hailang Huang, Haoqian Wang, Xiangxiang Chu

TL;DR

This work proposes a novel, dataset-free, and unified approach through recurrent posterior sampling utilizing a pretrained latent diffusion model that incorporates the multimodal understanding model to provide sematic priors for the generative model under a task-blind condition.

Abstract

Unified image restoration is a significantly challenging task in low-level vision. Existing methods either make tailored designs for specific tasks, limiting their generalizability across various types of degradation, or rely on training with paired datasets, thereby suffering from closed-set constraints. To address these issues, we propose a novel, dataset-free, and unified approach through recurrent posterior sampling utilizing a pretrained latent diffusion model. Our method incorporates the multimodal understanding model to provide sematic priors for the generative model under a task-blind condition. Furthermore, it utilizes a lightweight module to align the degraded input with the generated preference of the diffusion model, and employs recurrent refinement for posterior sampling. Extensive experiments demonstrate that our method outperforms state-of-the-art methods, validating its effectiveness and robustness. Our code and data are available at https://github.com/AMAP-ML/LD-RPS.

LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling

TL;DR

This work proposes a novel, dataset-free, and unified approach through recurrent posterior sampling utilizing a pretrained latent diffusion model that incorporates the multimodal understanding model to provide sematic priors for the generative model under a task-blind condition.

Abstract

Unified image restoration is a significantly challenging task in low-level vision. Existing methods either make tailored designs for specific tasks, limiting their generalizability across various types of degradation, or rely on training with paired datasets, thereby suffering from closed-set constraints. To address these issues, we propose a novel, dataset-free, and unified approach through recurrent posterior sampling utilizing a pretrained latent diffusion model. Our method incorporates the multimodal understanding model to provide sematic priors for the generative model under a task-blind condition. Furthermore, it utilizes a lightweight module to align the degraded input with the generated preference of the diffusion model, and employs recurrent refinement for posterior sampling. Extensive experiments demonstrate that our method outperforms state-of-the-art methods, validating its effectiveness and robustness. Our code and data are available at https://github.com/AMAP-ML/LD-RPS.

Paper Structure

This paper contains 13 sections, 11 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: LD-RPS has the capability to achieve high-quality zero-shot blind restoration in multiple tasks. Leveraging auxiliary text (keywords highlighted in blue) that describes image content or semantic information, our method achieves superior results in single degradation tasks, including image dehazing, denoising, and colorization, as well as in mixed degradation tasks, including low-light enhancement with denoising and image colorization with denoising.
  • Figure 2: Comparison between traditional diffusion posterior sampling methods for solving inverse problems and our recurrent posterior sampling approach based on latent diffusion.
  • Figure 3: The overall framework of LD-RPS. Initially, LD-RPS utilizes MLLMs to annotate the low-quality image and generate prompts. Based on these prompts, two distinct text-to-image processes are carried out: free diffusion and posterior sampling. In step 1, intermediate data produced by the diffusion process are employed to train and infer F-PAM, aligning the diffusion feature domain with the degraded image domain. In step 2, distance loss and quality loss are computed using the output of F-PAM and the intermediate diffusion results, with gradients propagated back. The entire diffusion process is recurrently conducted in a bootstrap manner to enhance generation quality. In the figure, $R$, $G$, $B$, and $M$ represent the three image channels and their mean, respectively.
  • Figure 4: Qualitative comparison results on the LOL dataset are visualized, with details highlighted in blue boxes for closer observation.
  • Figure 5: Qualitative comparison results on the HSTS subset of the RESIDE dataset are visualized.
  • ...and 4 more figures