Table of Contents
Fetching ...

Dual Prompting Image Restoration with Diffusion Transformers

Dehong Kong, Fan Li, Zhixin Wang, Jiaqi Xu, Renjing Pei, Wenbo Li, WenQi Ren

TL;DR

The paper tackles real-world image restoration by leveraging diffusion transformers (DiTs) and introducing DPIR, which uses two conditioning streams to extract and fuse information from low-quality inputs. DPIR comprises a degradation-robust VAE encoder for latent LQ conditioning and a dual prompting branch that combines textual prompts with global-local visual cues to guide restoration. The method achieves state-of-the-art results on synthetic and real degradations, outperforming GAN- and diffusion-based IR methods in both full-reference and no-reference metrics. The contributions include a lightweight LQ conditioning module, a global-local visual training strategy, and a dual prompting mechanism that leverages CLIP and T5 embeddings, demonstrating the effectiveness of conditioning-rich DiT-based IR for real-world scenarios with scalable training data.

Abstract

Recent state-of-the-art image restoration methods mostly adopt latent diffusion models with U-Net backbones, yet still facing challenges in achieving high-quality restoration due to their limited capabilities. Diffusion transformers (DiTs), like SD3, are emerging as a promising alternative because of their better quality with scalability. In this paper, we introduce DPIR (Dual Prompting Image Restoration), a novel image restoration method that effectivly extracts conditional information of low-quality images from multiple perspectives. Specifically, DPIR consits of two branches: a low-quality image conditioning branch and a dual prompting control branch. The first branch utilizes a lightweight module to incorporate image priors into the DiT with high efficiency. More importantly, we believe that in image restoration, textual description alone cannot fully capture its rich visual characteristics. Therefore, a dual prompting module is designed to provide DiT with additional visual cues, capturing both global context and local appearance. The extracted global-local visual prompts as extra conditional control, alongside textual prompts to form dual prompts, greatly enhance the quality of the restoration. Extensive experimental results demonstrate that DPIR delivers superior image restoration performance.

Dual Prompting Image Restoration with Diffusion Transformers

TL;DR

The paper tackles real-world image restoration by leveraging diffusion transformers (DiTs) and introducing DPIR, which uses two conditioning streams to extract and fuse information from low-quality inputs. DPIR comprises a degradation-robust VAE encoder for latent LQ conditioning and a dual prompting branch that combines textual prompts with global-local visual cues to guide restoration. The method achieves state-of-the-art results on synthetic and real degradations, outperforming GAN- and diffusion-based IR methods in both full-reference and no-reference metrics. The contributions include a lightweight LQ conditioning module, a global-local visual training strategy, and a dual prompting mechanism that leverages CLIP and T5 embeddings, demonstrating the effectiveness of conditioning-rich DiT-based IR for real-world scenarios with scalable training data.

Abstract

Recent state-of-the-art image restoration methods mostly adopt latent diffusion models with U-Net backbones, yet still facing challenges in achieving high-quality restoration due to their limited capabilities. Diffusion transformers (DiTs), like SD3, are emerging as a promising alternative because of their better quality with scalability. In this paper, we introduce DPIR (Dual Prompting Image Restoration), a novel image restoration method that effectivly extracts conditional information of low-quality images from multiple perspectives. Specifically, DPIR consits of two branches: a low-quality image conditioning branch and a dual prompting control branch. The first branch utilizes a lightweight module to incorporate image priors into the DiT with high efficiency. More importantly, we believe that in image restoration, textual description alone cannot fully capture its rich visual characteristics. Therefore, a dual prompting module is designed to provide DiT with additional visual cues, capturing both global context and local appearance. The extracted global-local visual prompts as extra conditional control, alongside textual prompts to form dual prompts, greatly enhance the quality of the restoration. Extensive experimental results demonstrate that DPIR delivers superior image restoration performance.

Paper Structure

This paper contains 24 sections, 3 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: DPIR exhibits excellent restoration performance on real low-quality images. We compare low quality input, and the results of models using only textual prompts, and our visual-text dual prompts in the lower part. The proposed dual prompting strategy consistently outperforms the single text prompting variant in terms of image restoration quality and fedility.
  • Figure 2: Framework of our proposed DPIR. Given a low-quality (LQ) image, a lightweight conditioning branch efficiently introduces the LQ information into the DiT backbone. Additionally, a dual prompting restoration branch extracts global and local visual information, alongside text prompts, to form visual-text dual prompts, which greatly enhances the restoration quality and fidelity.
  • Figure 3: The LQ conditioning branch has a lightweight feature extraction module and an adaptive feature alignment module.
  • Figure 4: This figure shows a dual prompting control branch and the generation of dual embedding and cls embedding.
  • Figure 5: Qualitative comparisons of different IR methods on DIV2K dataset. Our DPIR achieves the best visual performance.
  • ...and 4 more figures