Table of Contents
Fetching ...

Improving Consistency in Diffusion Models for Image Super-Resolution

Junhao Gu, Peng-Tao Jiang, Hao Zhang, Mi Zhou, Jinwei Chen, Wenming Yang, Bo Li

TL;DR

This paper tackles two key problems in diffusion-based Real-ISR: semantic mismatch between text-driven priors and pixel-level reconstruction, and training-inference inconsistency stemming from DDPM's HQ-latent assumption. It introduces ConsisSR, combining a Hybrid Prompt Adapter (HPA) that fuses text and CLIP image embeddings for fine-grained semantic guidance with Time-Aware Latent Augmentation (TALA) that perturbs early timesteps to align training with inference. Empirical results on synthetic and real-world datasets show state-of-the-art performance among diffusion-based SR approaches, with ablations demonstrating the effectiveness of both HPA and TALA. The work offers a practical path to leveraging powerful T2I priors for Real-ISR, with code to enable broad adoption and further improvements.

Abstract

Recent methods exploit the powerful text-to-image (T2I) diffusion models for real-world image super-resolution (Real-ISR) and achieve impressive results compared to previous models. However, we observe two kinds of inconsistencies in diffusion-based methods which hinder existing models from fully exploiting diffusion priors. The first is the semantic inconsistency arising from diffusion guidance. T2I generation focuses on semantic-level consistency with text prompts, while Real-ISR emphasizes pixel-level reconstruction from low-quality (LQ) images, necessitating more detailed semantic guidance from LQ inputs. The second is the training-inference inconsistency stemming from the DDPM, which improperly assumes high-quality (HQ) latent corrupted by Gaussian noise as denoising inputs for each timestep. To address these issues, we introduce ConsisSR to handle both semantic and training-inference consistencies. On the one hand, to address the semantic inconsistency, we proposed a Hybrid Prompt Adapter (HPA). Instead of text prompts with coarse-grained classification information, we leverage the more powerful CLIP image embeddings to explore additional color and texture guidance. On the other hand, we introduce Time-Aware Latent Augmentation (TALA) to bridge the training-inference inconsistency. Based on the probability function p(t), we accordingly enhance the SDSR training strategy. With LQ latent with Gaussian noise as inputs, our TALA not only focuses on diffusion noise but also refine the LQ latent towards the HQ counterpart. Our method demonstrates state-of-the-art performance among existing diffusion models. The code will be made publicly available.

Improving Consistency in Diffusion Models for Image Super-Resolution

TL;DR

This paper tackles two key problems in diffusion-based Real-ISR: semantic mismatch between text-driven priors and pixel-level reconstruction, and training-inference inconsistency stemming from DDPM's HQ-latent assumption. It introduces ConsisSR, combining a Hybrid Prompt Adapter (HPA) that fuses text and CLIP image embeddings for fine-grained semantic guidance with Time-Aware Latent Augmentation (TALA) that perturbs early timesteps to align training with inference. Empirical results on synthetic and real-world datasets show state-of-the-art performance among diffusion-based SR approaches, with ablations demonstrating the effectiveness of both HPA and TALA. The work offers a practical path to leveraging powerful T2I priors for Real-ISR, with code to enable broad adoption and further improvements.

Abstract

Recent methods exploit the powerful text-to-image (T2I) diffusion models for real-world image super-resolution (Real-ISR) and achieve impressive results compared to previous models. However, we observe two kinds of inconsistencies in diffusion-based methods which hinder existing models from fully exploiting diffusion priors. The first is the semantic inconsistency arising from diffusion guidance. T2I generation focuses on semantic-level consistency with text prompts, while Real-ISR emphasizes pixel-level reconstruction from low-quality (LQ) images, necessitating more detailed semantic guidance from LQ inputs. The second is the training-inference inconsistency stemming from the DDPM, which improperly assumes high-quality (HQ) latent corrupted by Gaussian noise as denoising inputs for each timestep. To address these issues, we introduce ConsisSR to handle both semantic and training-inference consistencies. On the one hand, to address the semantic inconsistency, we proposed a Hybrid Prompt Adapter (HPA). Instead of text prompts with coarse-grained classification information, we leverage the more powerful CLIP image embeddings to explore additional color and texture guidance. On the other hand, we introduce Time-Aware Latent Augmentation (TALA) to bridge the training-inference inconsistency. Based on the probability function p(t), we accordingly enhance the SDSR training strategy. With LQ latent with Gaussian noise as inputs, our TALA not only focuses on diffusion noise but also refine the LQ latent towards the HQ counterpart. Our method demonstrates state-of-the-art performance among existing diffusion models. The code will be made publicly available.

Paper Structure

This paper contains 19 sections, 5 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparisons among existing semantic prompts for SDSR. Previous SDSR methods only apply the SD prior to text prompts, such as captioning, whereas handling image prompts requires retraining additional layers. Instead, we leverage the consistent CLIP embedding space for both text and image prompts, which efficiently provide more fine-grained semantic guidance.
  • Figure 2: Visualization of the truncated outputs reveals that smooth results naturally emerge in the early timesteps. This indicates that the denoising inputs gradually transition from LQ latent with Gaussian noise to that of HQ latent.
  • Figure 3: Overall training pipeline of our ConsisSR. We propose the TALA strategy, which accordingly substitutes HQ inputs with LQ ones in the early timesteps, thereby improving training-inference consistency. Additionally, we introduce the HPA module to leverage both CLIP's text and image embeddings to enhance semantic consistency, thereby producing more credible textures.
  • Figure 4: Venn diagram of CLIP's embedding space. The similarity is derived from the cosine distance after Softmax between image and various text descriptions.
  • Figure 5: Network architecture of our transformer block with the proposed Hybrid Prompt Adapter (HPA).
  • ...and 4 more figures