Table of Contents
Fetching ...

Your Pre-trained Diffusion Model Secretly Knows Restoration

Sudarshan Rajagopalan, Vishal M. Patel

Abstract

Pre-trained diffusion models have enabled significant advancements in All-in-One Restoration (AiOR), offering improved perceptual quality and generalization. However, diffusion-based restoration methods primarily rely on fine-tuning or Control-Net style modules to leverage the pre-trained diffusion model's priors for AiOR. In this work, we show that these pre-trained diffusion models inherently possess restoration behavior, which can be unlocked by directly learning prompt embeddings at the output of the text encoder. Interestingly, this behavior is largely inaccessible through text prompts and text-token embedding optimization. Furthermore, we observe that naive prompt learning is unstable because the forward noising process using degraded images is misaligned with the reverse sampling trajectory. To resolve this, we train prompts within a diffusion bridge formulation that aligns training and inference dynamics, enforcing a coherent denoising path from noisy degraded states to clean images. Building on these insights, we introduce our lightweight learned prompts on the pre-trained WAN video model and FLUX image models, converting them into high-performing restoration models. Extensive experiments demonstrate that our approach achieves competitive performance and generalization across diverse degradations, while avoiding fine-tuning and restoration-specific control modules.

Your Pre-trained Diffusion Model Secretly Knows Restoration

Abstract

Pre-trained diffusion models have enabled significant advancements in All-in-One Restoration (AiOR), offering improved perceptual quality and generalization. However, diffusion-based restoration methods primarily rely on fine-tuning or Control-Net style modules to leverage the pre-trained diffusion model's priors for AiOR. In this work, we show that these pre-trained diffusion models inherently possess restoration behavior, which can be unlocked by directly learning prompt embeddings at the output of the text encoder. Interestingly, this behavior is largely inaccessible through text prompts and text-token embedding optimization. Furthermore, we observe that naive prompt learning is unstable because the forward noising process using degraded images is misaligned with the reverse sampling trajectory. To resolve this, we train prompts within a diffusion bridge formulation that aligns training and inference dynamics, enforcing a coherent denoising path from noisy degraded states to clean images. Building on these insights, we introduce our lightweight learned prompts on the pre-trained WAN video model and FLUX image models, converting them into high-performing restoration models. Extensive experiments demonstrate that our approach achieves competitive performance and generalization across diverse degradations, while avoiding fine-tuning and restoration-specific control modules.

Paper Structure

This paper contains 14 sections, 10 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Popular editing based approaches such as SDEdit sdedit, and Prompt-to-prompt p2p with Null-text inversion nti (P2P + NTI) work well for high-level editing but perform poorly for restoration tasks, in this case dehazing.
  • Figure 2: Text token-space prompting is ineffective for restoration: Even with optimized token prompts (textual inversion/prompt tuning), the model tends to denoise without removing degradations, whereas embedding-space optimization enables restoration from the same noisy degraded input.
  • Figure 3: (a) We freeze the diffusion backbone and optimize only the conditioning: token-space prompt optimization fails, while embedding-space (text-encoder output) optimization elicits restoration. (b) Naive tuning yields states anchored at $z_{\text{deg}}$; DDBM ddbm is pinned at both endpoints; our desired/EBR-style gcb bridge starts from noisy degraded inputs and denoises monotonically as the content transitions toward $z_{\text{clean}}$. (c) Naive training sees a different state family than inference, causing trajectory misalignment. (d) Bridge-based training aligns train/test states; DDBM may under-correct early (low noise near $z_{\text{deg}}$), while the desired/EBR bridge enables stronger correction on an aligned path.
  • Figure 4: Qualitative comparisons of the pre-trained FLUX model using our learned prompts with state-of-the-art AiOR approaches. Our approach enables the pre-trained FLUX to achieve remarkable restoration performance. WBSnow denotes the snow subset of the WeatherBench weatherbench dataset.
  • Figure 5: Qualitative comparisons of the pre-trained WAN model using our learned prompts with state-of-the-art AiOR approaches. ViWS-Net and AverNet are video restoration approaches while others are proposed for image restoration. Our prompts elicit the strong restoration potential of the pre-trained WAN model. AAU: AAURainSnow aau, NTU: real test set of NTU-Rain spac.
  • ...and 4 more figures