Table of Contents
Fetching ...

Multimodal Prompt Perceiver: Empower Adaptiveness, Generalizability and Fidelity for All-in-One Image Restoration

Yuang Ai, Huaibo Huang, Xiaoqiang Zhou, Jiexiang Wang, Ran He

TL;DR

The paper tackles the challenge of all-in-one image restoration under realistic, mixed degradations by introducing MPerceiver, a multimodal prompt learning framework that harnesses Stable Diffusion priors. It combines a textual branch (CM-Adapter mapping CLIP image features to degradation-aware text prompts) and a visual branch (IR-Adapter delivering multiscale detail cues) whose influences are dynamically weighted by degradation predictions, plus a plug-in Detail Refinement Module to boost fidelity. The model is trained with latent-diffusion objectives and demonstrates strong adaptiveness, generalizability, and fidelity across 16 IR tasks, including zero-shot and few-shot scenarios, and excels on mixed real-world degradations. Overall, MPerceiver offers a robust, scalable approach to all-in-one IR by effectively leveraging diffusion priors and multimodal prompts to handle unknown degradations with high fidelity and broad generalization.

Abstract

Despite substantial progress, all-in-one image restoration (IR) grapples with persistent challenges in handling intricate real-world degradations. This paper introduces MPerceiver: a novel multimodal prompt learning approach that harnesses Stable Diffusion (SD) priors to enhance adaptiveness, generalizability and fidelity for all-in-one image restoration. Specifically, we develop a dual-branch module to master two types of SD prompts: textual for holistic representation and visual for multiscale detail representation. Both prompts are dynamically adjusted by degradation predictions from the CLIP image encoder, enabling adaptive responses to diverse unknown degradations. Moreover, a plug-in detail refinement module improves restoration fidelity via direct encoder-to-decoder information transformation. To assess our method, MPerceiver is trained on 9 tasks for all-in-one IR and outperforms state-of-the-art task-specific methods across most tasks. Post multitask pre-training, MPerceiver attains a generalized representation in low-level vision, exhibiting remarkable zero-shot and few-shot capabilities in unseen tasks. Extensive experiments on 16 IR tasks underscore the superiority of MPerceiver in terms of adaptiveness, generalizability and fidelity.

Multimodal Prompt Perceiver: Empower Adaptiveness, Generalizability and Fidelity for All-in-One Image Restoration

TL;DR

The paper tackles the challenge of all-in-one image restoration under realistic, mixed degradations by introducing MPerceiver, a multimodal prompt learning framework that harnesses Stable Diffusion priors. It combines a textual branch (CM-Adapter mapping CLIP image features to degradation-aware text prompts) and a visual branch (IR-Adapter delivering multiscale detail cues) whose influences are dynamically weighted by degradation predictions, plus a plug-in Detail Refinement Module to boost fidelity. The model is trained with latent-diffusion objectives and demonstrates strong adaptiveness, generalizability, and fidelity across 16 IR tasks, including zero-shot and few-shot scenarios, and excels on mixed real-world degradations. Overall, MPerceiver offers a robust, scalable approach to all-in-one IR by effectively leveraging diffusion priors and multimodal prompts to handle unknown degradations with high fidelity and broad generalization.

Abstract

Despite substantial progress, all-in-one image restoration (IR) grapples with persistent challenges in handling intricate real-world degradations. This paper introduces MPerceiver: a novel multimodal prompt learning approach that harnesses Stable Diffusion (SD) priors to enhance adaptiveness, generalizability and fidelity for all-in-one image restoration. Specifically, we develop a dual-branch module to master two types of SD prompts: textual for holistic representation and visual for multiscale detail representation. Both prompts are dynamically adjusted by degradation predictions from the CLIP image encoder, enabling adaptive responses to diverse unknown degradations. Moreover, a plug-in detail refinement module improves restoration fidelity via direct encoder-to-decoder information transformation. To assess our method, MPerceiver is trained on 9 tasks for all-in-one IR and outperforms state-of-the-art task-specific methods across most tasks. Post multitask pre-training, MPerceiver attains a generalized representation in low-level vision, exhibiting remarkable zero-shot and few-shot capabilities in unseen tasks. Extensive experiments on 16 IR tasks underscore the superiority of MPerceiver in terms of adaptiveness, generalizability and fidelity.
Paper Structure (13 sections, 5 equations, 8 figures, 9 tables)

This paper contains 13 sections, 5 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Our MPerceiver excels in image restoration tasks with: (I) All-in-one: Addressing diverse degradations, including challenging mixed ones, through a single pretrained network. (II) Zero-shot: Handling training-unseen degradations effortlessly. (III) Few-shot: Adapting to new tasks with minimal data (about 3%-5% of data used by task-specific methods).
  • Figure 2: PSNR comparison with state-of-the-art all-in-one and task-specific methods across 10 tasks. Best viewed in color.
  • Figure 3: Illustration of MPerceiver's dual-branch module with multimodal prompts. Textual Branch: CLIP image embeddings are transformed into text vectors through cross-modal inversion, which are then used alongside textual prompts as holistic representations for SD. Visual Branch: IR-Adapter decomposes VAE image embeddings into multi-scale features, which are then dynamically modulated by visual prompts to provide detail guidance for SD adaptively.
  • Figure 4: Illustration of the detail refinement module (DRM). For simplicity, visual prompts and degradation predictions are omitted as input to the visual prompt (VP) modulator. Apart from the DRM which is trainable, the other modules are all frozen.
  • Figure 5: Effect of the DRM on raindrop removal (top row from raindrop) and motion deblurring (bottom row from gopro). The proposed DRM significantly improves the fidelity of the results.
  • ...and 3 more figures