Improving Consistency in Diffusion Models for Image Super-Resolution
Junhao Gu, Peng-Tao Jiang, Hao Zhang, Mi Zhou, Jinwei Chen, Wenming Yang, Bo Li
TL;DR
This paper tackles two key problems in diffusion-based Real-ISR: semantic mismatch between text-driven priors and pixel-level reconstruction, and training-inference inconsistency stemming from DDPM's HQ-latent assumption. It introduces ConsisSR, combining a Hybrid Prompt Adapter (HPA) that fuses text and CLIP image embeddings for fine-grained semantic guidance with Time-Aware Latent Augmentation (TALA) that perturbs early timesteps to align training with inference. Empirical results on synthetic and real-world datasets show state-of-the-art performance among diffusion-based SR approaches, with ablations demonstrating the effectiveness of both HPA and TALA. The work offers a practical path to leveraging powerful T2I priors for Real-ISR, with code to enable broad adoption and further improvements.
Abstract
Recent methods exploit the powerful text-to-image (T2I) diffusion models for real-world image super-resolution (Real-ISR) and achieve impressive results compared to previous models. However, we observe two kinds of inconsistencies in diffusion-based methods which hinder existing models from fully exploiting diffusion priors. The first is the semantic inconsistency arising from diffusion guidance. T2I generation focuses on semantic-level consistency with text prompts, while Real-ISR emphasizes pixel-level reconstruction from low-quality (LQ) images, necessitating more detailed semantic guidance from LQ inputs. The second is the training-inference inconsistency stemming from the DDPM, which improperly assumes high-quality (HQ) latent corrupted by Gaussian noise as denoising inputs for each timestep. To address these issues, we introduce ConsisSR to handle both semantic and training-inference consistencies. On the one hand, to address the semantic inconsistency, we proposed a Hybrid Prompt Adapter (HPA). Instead of text prompts with coarse-grained classification information, we leverage the more powerful CLIP image embeddings to explore additional color and texture guidance. On the other hand, we introduce Time-Aware Latent Augmentation (TALA) to bridge the training-inference inconsistency. Based on the probability function p(t), we accordingly enhance the SDSR training strategy. With LQ latent with Gaussian noise as inputs, our TALA not only focuses on diffusion noise but also refine the LQ latent towards the HQ counterpart. Our method demonstrates state-of-the-art performance among existing diffusion models. The code will be made publicly available.
