Dual Prompting Image Restoration with Diffusion Transformers
Dehong Kong, Fan Li, Zhixin Wang, Jiaqi Xu, Renjing Pei, Wenbo Li, WenQi Ren
TL;DR
The paper tackles real-world image restoration by leveraging diffusion transformers (DiTs) and introducing DPIR, which uses two conditioning streams to extract and fuse information from low-quality inputs. DPIR comprises a degradation-robust VAE encoder for latent LQ conditioning and a dual prompting branch that combines textual prompts with global-local visual cues to guide restoration. The method achieves state-of-the-art results on synthetic and real degradations, outperforming GAN- and diffusion-based IR methods in both full-reference and no-reference metrics. The contributions include a lightweight LQ conditioning module, a global-local visual training strategy, and a dual prompting mechanism that leverages CLIP and T5 embeddings, demonstrating the effectiveness of conditioning-rich DiT-based IR for real-world scenarios with scalable training data.
Abstract
Recent state-of-the-art image restoration methods mostly adopt latent diffusion models with U-Net backbones, yet still facing challenges in achieving high-quality restoration due to their limited capabilities. Diffusion transformers (DiTs), like SD3, are emerging as a promising alternative because of their better quality with scalability. In this paper, we introduce DPIR (Dual Prompting Image Restoration), a novel image restoration method that effectivly extracts conditional information of low-quality images from multiple perspectives. Specifically, DPIR consits of two branches: a low-quality image conditioning branch and a dual prompting control branch. The first branch utilizes a lightweight module to incorporate image priors into the DiT with high efficiency. More importantly, we believe that in image restoration, textual description alone cannot fully capture its rich visual characteristics. Therefore, a dual prompting module is designed to provide DiT with additional visual cues, capturing both global context and local appearance. The extracted global-local visual prompts as extra conditional control, alongside textual prompts to form dual prompts, greatly enhance the quality of the restoration. Extensive experimental results demonstrate that DPIR delivers superior image restoration performance.
