Table of Contents
Fetching ...

Image Super-Resolution with Text Prompt Diffusion

Zheng Chen, Yulun Zhang, Jinjin Gu, Xin Yuan, Linghe Kong, Guihai Chen, Xiaokang Yang

TL;DR

This work tackles single image super-resolution under unknown and diverse degradations by injecting textual priors as degradation descriptors. It introduces a text–image generation pipeline that converts degradation into discrete text prompts and pairs them with HR–LR data, and presents PromptSR, a diffusion-based SR model conditioned on upsampled LR images and text prompts generated by an MLLM with a pre-trained text encoder. The approach yields state-of-the-art perceptual quality on synthetic and real-world datasets, demonstrating that flexible textual guidance can effectively model degradation and improve SR beyond traditional LR-only conditioning. The findings highlight a practical pathway for incorporating natural language priors into SR, with broader implications for multimodal restoration tasks and real-world deployment.

Abstract

Image super-resolution (SR) methods typically model degradation to improve reconstruction accuracy in complex and unknown degradation scenarios. However, extracting degradation information from low-resolution images is challenging, which limits the model performance. To boost image SR performance, one feasible approach is to introduce additional priors. Inspired by advancements in multi-modal methods and text prompt image processing, we introduce text prompts to image SR to provide degradation priors. Specifically, we first design a text-image generation pipeline to integrate text into the SR dataset through the text degradation representation and degradation model. By adopting a discrete design, the text representation is flexible and user-friendly. Meanwhile, we propose the PromptSR to realize the text prompt SR. The PromptSR leverages the latest multi-modal large language model (MLLM) to generate prompts from low-resolution images. It also utilizes the pre-trained language model (e.g., T5 or CLIP) to enhance text comprehension. We train the PromptSR on the text-image dataset. Extensive experiments indicate that introducing text prompts into SR, yields impressive results on both synthetic and real-world images. Code: https://github.com/zhengchen1999/PromptSR.

Image Super-Resolution with Text Prompt Diffusion

TL;DR

This work tackles single image super-resolution under unknown and diverse degradations by injecting textual priors as degradation descriptors. It introduces a text–image generation pipeline that converts degradation into discrete text prompts and pairs them with HR–LR data, and presents PromptSR, a diffusion-based SR model conditioned on upsampled LR images and text prompts generated by an MLLM with a pre-trained text encoder. The approach yields state-of-the-art perceptual quality on synthetic and real-world datasets, demonstrating that flexible textual guidance can effectively model degradation and improve SR beyond traditional LR-only conditioning. The findings highlight a practical pathway for incorporating natural language priors into SR, with broader implications for multimodal restoration tasks and real-world deployment.

Abstract

Image super-resolution (SR) methods typically model degradation to improve reconstruction accuracy in complex and unknown degradation scenarios. However, extracting degradation information from low-resolution images is challenging, which limits the model performance. To boost image SR performance, one feasible approach is to introduce additional priors. Inspired by advancements in multi-modal methods and text prompt image processing, we introduce text prompts to image SR to provide degradation priors. Specifically, we first design a text-image generation pipeline to integrate text into the SR dataset through the text degradation representation and degradation model. By adopting a discrete design, the text representation is flexible and user-friendly. Meanwhile, we propose the PromptSR to realize the text prompt SR. The PromptSR leverages the latest multi-modal large language model (MLLM) to generate prompts from low-resolution images. It also utilizes the pre-trained language model (e.g., T5 or CLIP) to enhance text comprehension. We train the PromptSR on the text-image dataset. Extensive experiments indicate that introducing text prompts into SR, yields impressive results on both synthetic and real-world images. Code: https://github.com/zhengchen1999/PromptSR.
Paper Structure (19 sections, 6 figures, 8 tables)

This paper contains 19 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Visual comparison ($\times$4). The LR image undergoes complex and unknown degradations (e.g., blur, noise, and downsampling). By introducing text prompts (e.g., [heavy blur, upsample, medium noise, medium compression, downsample], in the instance) into the SR task to provide degradation priors, the reconstruction quality can be effectively improved.
  • Figure 2: Illustration of the text-image generation pipeline. (a) The pipeline comprises the degradation model (top) and the text representation (bottom). The degradation model comprises five steps, where "Comp" denotes the compression. The text representation describes each degradation operation in a discretized manner, e.g., [medium noise] for noise operation. Except for the illustrated aligned prompt-degradation sequence, our pipeline supports more flexible degradation and prompt formats. (b) An example to display the dataset.
  • Figure 3: Visual comparison ($\times$4) of different text contents. Caption (description of the overall image): [people on a weathered balcony of a building with closed shutters]. Degradation: [light blur, upsample, light noise, heavy compression, downsample].
  • Figure 4: The overall architecture of the PromptSR. The text prompt $\mathbf{c}$ is generated from the LR image $\mathbf{x}$ using the multi-modal large language model (MLLM) and embedded by the pre-trained text encoder. The denoising network (DN) takes the bicubic-upsampled LR image (target HR image resolution), the noise image $\mathbf{y}_t$ ($t\in$$[1, T]$), and the text embedding as inputs to predict noise.
  • Figure 5: Visual results ($\times$4) of different prompts. [...] represents the prompt. Boxes in the figures highlight the differences in details.
  • ...and 1 more figures