Table of Contents
Fetching ...

MoE-DiffIR: Task-customized Diffusion Priors for Universal Compressed Image Restoration

Yulin Ren, Xin Li, Bingchen Li, Xingrui Wang, Mengxi Guo, Shijie Zhao, Li Zhang, Zhibo Chen

TL;DR

MoE-DiffIR tackles the problem of universal compressed image restoration across diverse codecs by learning task-specific diffusion priors from Stable Diffusion. It introduces a Mixture-of-Experts Prompt module with a degradation-aware router and a Visual2Text adapter to leverage cross-modal priors, enabling robust texture restoration at low bitrates. A two-stage fine-tuning regime and a comprehensive CIR dataset benchmark across 21 degradations validate its effectiveness, showing superior perceptual quality (LPIPS/FID) and competitive fidelity (PSNR/SSIM) compared to state-of-the-art diffusion-based IR methods. The work demonstrates the practical potential of universal CIR with diffusion priors and cross-modal guidance, while noting limitations at extreme bitrates and suggesting avenues for further improvement.

Abstract

We present MoE-DiffIR, an innovative universal compressed image restoration (CIR) method with task-customized diffusion priors. This intends to handle two pivotal challenges in the existing CIR methods: (i) lacking adaptability and universality for different image codecs, e.g., JPEG and WebP; (ii) poor texture generation capability, particularly at low bitrates. Specifically, our MoE-DiffIR develops the powerful mixture-of-experts (MoE) prompt module, where some basic prompts cooperate to excavate the task-customized diffusion priors from Stable Diffusion (SD) for each compression task. Moreover, the degradation-aware routing mechanism is proposed to enable the flexible assignment of basic prompts. To activate and reuse the cross-modality generation prior of SD, we design the visual-to-text adapter for MoE-DiffIR, which aims to adapt the embedding of low-quality images from the visual domain to the textual domain as the textual guidance for SD, enabling more consistent and reasonable texture generation. We also construct one comprehensive benchmark dataset for universal CIR, covering 21 types of degradations from 7 popular traditional and learned codecs. Extensive experiments on universal CIR have demonstrated the excellent robustness and texture restoration capability of our proposed MoE-DiffIR. The project can be found at https://renyulin-f.github.io/MoE-DiffIR.github.io/.

MoE-DiffIR: Task-customized Diffusion Priors for Universal Compressed Image Restoration

TL;DR

MoE-DiffIR tackles the problem of universal compressed image restoration across diverse codecs by learning task-specific diffusion priors from Stable Diffusion. It introduces a Mixture-of-Experts Prompt module with a degradation-aware router and a Visual2Text adapter to leverage cross-modal priors, enabling robust texture restoration at low bitrates. A two-stage fine-tuning regime and a comprehensive CIR dataset benchmark across 21 degradations validate its effectiveness, showing superior perceptual quality (LPIPS/FID) and competitive fidelity (PSNR/SSIM) compared to state-of-the-art diffusion-based IR methods. The work demonstrates the practical potential of universal CIR with diffusion priors and cross-modal guidance, while noting limitations at extreme bitrates and suggesting avenues for further improvement.

Abstract

We present MoE-DiffIR, an innovative universal compressed image restoration (CIR) method with task-customized diffusion priors. This intends to handle two pivotal challenges in the existing CIR methods: (i) lacking adaptability and universality for different image codecs, e.g., JPEG and WebP; (ii) poor texture generation capability, particularly at low bitrates. Specifically, our MoE-DiffIR develops the powerful mixture-of-experts (MoE) prompt module, where some basic prompts cooperate to excavate the task-customized diffusion priors from Stable Diffusion (SD) for each compression task. Moreover, the degradation-aware routing mechanism is proposed to enable the flexible assignment of basic prompts. To activate and reuse the cross-modality generation prior of SD, we design the visual-to-text adapter for MoE-DiffIR, which aims to adapt the embedding of low-quality images from the visual domain to the textual domain as the textual guidance for SD, enabling more consistent and reasonable texture generation. We also construct one comprehensive benchmark dataset for universal CIR, covering 21 types of degradations from 7 popular traditional and learned codecs. Extensive experiments on universal CIR have demonstrated the excellent robustness and texture restoration capability of our proposed MoE-DiffIR. The project can be found at https://renyulin-f.github.io/MoE-DiffIR.github.io/.
Paper Structure (26 sections, 4 equations, 11 figures, 11 tables)

This paper contains 26 sections, 4 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Visualization of restored compressed images with our MoE-DiffIR on various image codecs and coding modes. Our method can restore diverse compressed images at low bitrates through a single network while possessing high texture generation capability.
  • Figure 2: Comparison of different prompt interaction methods. Here we mainly categorize them into three types: (a) Single Prompt ma2023prores, (c) Multiple Prompts li2023prompt-PIPluo2023controlling-DACLIPai2023multimodal-mperceiver, (b) MoE-Prompt (Ours). We use Mixture of Experts routing methods to select different combinations of prompts for various compression tasks. In (b), DP stands for Degradation Prior which is obtained from LQ images through pre-trained CLIP encoder of DACLIP.
  • Figure 3: The framework of the proposed MoE-DiffIR enables dynamic prompt learning for multiple CIR tasks through (b) MoE-Prompt Generator, and introduces a visual-to-text adapter to generate more reasonable texture. In MoE-DiffIR: MoE-Prompt Module (c) aims to extract multi-scale features to interact with (b). Here (a) depicts the process of fine-tuning Stable Diffusion, which consists of two stages. Stage I: only the MoE-Prompt Module is pre-trained to excavate task-customized diffusion priors for each CIR task. Stage II: the (d) Decoder Compensator is fine-tuned for structural correction.
  • Figure 4: Visual comparisons between our methods and other state of the arts methods. This figure demonstrate 5 different compression tasks: JPEG (QF=10), VVC (QP=47), HEVC (QP=47), $C_{SSIM}$("Low" bitrates), $C_{PSNR}$("Low" bitrates). More visual results can be found in Sec. \ref{['Appendix:more visual Results']} of the Appendix.
  • Figure 5: Visual ablation results: different prompt interaction designs, use of V2T adapter and use of degradation prior (DP).
  • ...and 6 more figures