Table of Contents
Fetching ...

Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks

Yuhao Liu, Zhanghan Ke, Fang Liu, Nanxuan Zhao, Rynson W. H. Lau

TL;DR

Diff-Plugin addresses the challenge of preserving fine input details in diffusion-based low-level vision tasks by introducing a modular, plug-and-play framework. It adds Task-Plugins with dual branches (Task-Prompt Branch and Spatial Complement Branch) and a Plugin-Selector that routes natural-language prompts to the appropriate plugin, enabling text-driven, multi-task editing without retraining the base model. The approach demonstrates strong fidelity across eight tasks, showing state-of-the-art performance among diffusion-based methods and competitive results versus regression-based models, with robust training and schedulability across dataset sizes. Practically, this framework provides a flexible, scalable tool for reliable, detail-preserving edits in real-world scenarios, while also highlighting potential for region-specific guidance in future work.

Abstract

Diffusion models trained on large-scale datasets have achieved remarkable progress in image synthesis. However, due to the randomness in the diffusion process, they often struggle with handling diverse low-level tasks that require details preservation. To overcome this limitation, we present a new Diff-Plugin framework to enable a single pre-trained diffusion model to generate high-fidelity results across a variety of low-level tasks. Specifically, we first propose a lightweight Task-Plugin module with a dual branch design to provide task-specific priors, guiding the diffusion process in preserving image content. We then propose a Plugin-Selector that can automatically select different Task-Plugins based on the text instruction, allowing users to edit images by indicating multiple low-level tasks with natural language. We conduct extensive experiments on 8 low-level vision tasks. The results demonstrate the superiority of Diff-Plugin over existing methods, particularly in real-world scenarios. Our ablations further validate that Diff-Plugin is stable, schedulable, and supports robust training across different dataset sizes.

Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks

TL;DR

Diff-Plugin addresses the challenge of preserving fine input details in diffusion-based low-level vision tasks by introducing a modular, plug-and-play framework. It adds Task-Plugins with dual branches (Task-Prompt Branch and Spatial Complement Branch) and a Plugin-Selector that routes natural-language prompts to the appropriate plugin, enabling text-driven, multi-task editing without retraining the base model. The approach demonstrates strong fidelity across eight tasks, showing state-of-the-art performance among diffusion-based methods and competitive results versus regression-based models, with robust training and schedulability across dataset sizes. Practically, this framework provides a flexible, scalable tool for reliable, detail-preserving edits in real-world scenarios, while also highlighting potential for region-specific guidance in future work.

Abstract

Diffusion models trained on large-scale datasets have achieved remarkable progress in image synthesis. However, due to the randomness in the diffusion process, they often struggle with handling diverse low-level tasks that require details preservation. To overcome this limitation, we present a new Diff-Plugin framework to enable a single pre-trained diffusion model to generate high-fidelity results across a variety of low-level tasks. Specifically, we first propose a lightweight Task-Plugin module with a dual branch design to provide task-specific priors, guiding the diffusion process in preserving image content. We then propose a Plugin-Selector that can automatically select different Task-Plugins based on the text instruction, allowing users to edit images by indicating multiple low-level tasks with natural language. We conduct extensive experiments on 8 low-level vision tasks. The results demonstrate the superiority of Diff-Plugin over existing methods, particularly in real-world scenarios. Our ablations further validate that Diff-Plugin is stable, schedulable, and supports robust training across different dataset sizes.
Paper Structure (11 sections, 7 equations, 7 figures, 6 tables)

This paper contains 11 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Real-world applications of Diff-Plugin visualized across distinct single-type and one multi-type low-level vision tasks. Diff-Plugin allows users to selectively conduct interested low-level vision tasks via natural languages and can generate high-fidelity results.
  • Figure 2: Stable Diffusion (SD) rombach2022high results on four low-level vision tasks: desnowing, deblurring, deraining, and highlight removal. Each sub-figure illustrates a two-step process: First, we generate the left image using SD with a full-text description, where task-critical attributes are highlighted in red. Then, we remove unwanted attributes (indicated with strikethrough), optionally add new attributes (denoted with orange word), and employ the img2img function in SD, using the left image as a condition to generate the edited image on the right. We observe that while SD can grasp rich attributes of various low-level tasks and create content consistent with descriptions, its inherent randomness often leads to content change in further editing. For instance, in sub-fig (1), besides addressing the primary task-related degradation (e.g., snow), SD also alters unrelated content (e.g., face profile).
  • Figure 3: Schematic illustration of the Diff-Plugin framework. Diff-Plugin identifies appropriate Task-Plugin $\mathcal{P}$ based on the user prompts, extracts task-specific priors, and then injects them into the pre-trained diffusion model to generate the user-desired results.
  • Figure 4: Schematic illustration of task-specific priors extraction via the proposed lightweight Task-Plugin. Task-Plugin processes three inputs: time step $t$, visual prompt from $\textit{Enc}_I(\cdot)$, and image content from $\textit{Enc}_V(\cdot)$. It distills visual guidance $\mathbf{F}^p$ via a task-prompt branch and extracts spatial features $\mathbf{F}^s$ through a spatial complement branch, jointly for task-specific priors.
  • Figure 5: Qualitative Comparison. Our Diff-Plugin notably surpasses regression-based method (3) and diffusion-based methods (4)-(8) in performance. Magnified regions of several tasks are provided for clarity. Refer to Supplemental for further comparisons.
  • ...and 2 more figures