Table of Contents
Fetching ...

SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution

Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, Lei Zhang

TL;DR

This work tackles Real-ISR by addressing semantic fidelity when generative diffusion priors are conditioned on degraded LR inputs. It introduces SeeSR, a two-stage framework that first trains a Degradation-Aware Prompt Extractor (DAPE) to produce soft prompts and hard tag prompts, aligned to HR semantics through a tag model. In the second stage, these prompts condition a pretrained diffusion model via a ControlNet-like mechanism and representation cross-attention, with LR latent embedding at inference to reduce artifacts. Experiments on synthetic and real-world benchmarks show SeeSR yields superior no-reference perceptual quality and explicit semantic restoration, outperforming prior diffusion-based and GAN-based Real-ISR methods. Limitations include potential tag errors under severe degradation and difficulties reconstructing small-scale text, suggesting directions for improved prompts and auxiliary masks.

Abstract

Owe to the powerful generative priors, the pre-trained text-to-image (T2I) diffusion models have become increasingly popular in solving the real-world image super-resolution problem. However, as a consequence of the heavy quality degradation of input low-resolution (LR) images, the destruction of local structures can lead to ambiguous image semantics. As a result, the content of reproduced high-resolution image may have semantic errors, deteriorating the super-resolution performance. To address this issue, we present a semantics-aware approach to better preserve the semantic fidelity of generative real-world image super-resolution. First, we train a degradation-aware prompt extractor, which can generate accurate soft and hard semantic prompts even under strong degradation. The hard semantic prompts refer to the image tags, aiming to enhance the local perception ability of the T2I model, while the soft semantic prompts compensate for the hard ones to provide additional representation information. These semantic prompts encourage the T2I model to generate detailed and semantically accurate results. Furthermore, during the inference process, we integrate the LR images into the initial sampling noise to mitigate the diffusion model's tendency to generate excessive random details. The experiments show that our method can reproduce more realistic image details and hold better the semantics. The source code of our method can be found at https://github.com/cswry/SeeSR.

SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution

TL;DR

This work tackles Real-ISR by addressing semantic fidelity when generative diffusion priors are conditioned on degraded LR inputs. It introduces SeeSR, a two-stage framework that first trains a Degradation-Aware Prompt Extractor (DAPE) to produce soft prompts and hard tag prompts, aligned to HR semantics through a tag model. In the second stage, these prompts condition a pretrained diffusion model via a ControlNet-like mechanism and representation cross-attention, with LR latent embedding at inference to reduce artifacts. Experiments on synthetic and real-world benchmarks show SeeSR yields superior no-reference perceptual quality and explicit semantic restoration, outperforming prior diffusion-based and GAN-based Real-ISR methods. Limitations include potential tag errors under severe degradation and difficulties reconstructing small-scale text, suggesting directions for improved prompts and auxiliary masks.

Abstract

Owe to the powerful generative priors, the pre-trained text-to-image (T2I) diffusion models have become increasingly popular in solving the real-world image super-resolution problem. However, as a consequence of the heavy quality degradation of input low-resolution (LR) images, the destruction of local structures can lead to ambiguous image semantics. As a result, the content of reproduced high-resolution image may have semantic errors, deteriorating the super-resolution performance. To address this issue, we present a semantics-aware approach to better preserve the semantic fidelity of generative real-world image super-resolution. First, we train a degradation-aware prompt extractor, which can generate accurate soft and hard semantic prompts even under strong degradation. The hard semantic prompts refer to the image tags, aiming to enhance the local perception ability of the T2I model, while the soft semantic prompts compensate for the hard ones to provide additional representation information. These semantic prompts encourage the T2I model to generate detailed and semantically accurate results. Furthermore, during the inference process, we integrate the LR images into the initial sampling noise to mitigate the diffusion model's tendency to generate excessive random details. The experiments show that our method can reproduce more realistic image details and hold better the semantics. The source code of our method can be found at https://github.com/cswry/SeeSR.
Paper Structure (12 sections, 2 equations, 4 figures, 4 tables)

This paper contains 12 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The comparison of different styles of prompts and their corresponding Real-ISR results with PASD yang2023pixel. (a) Input LR image. (b)-(d) show the extracted classification-style, caption-style and tag-style prompts from LR image and the corresponding Real-ISR results. (e) Null prompt and its corresponding Real-ISR result. (f)-(h) show the extracted classification-style, caption-style and tag-style prompts from HR image and their corresponding Real-ISR results. (i) HR image.
  • Figure 2: Overview of SeeSR. (a) In the first stage, we train a degradation-aware prompt extractor (DAPE), which is initialized from a tag model. DAPE is trained to align the encoding of the degraded LR image to the encoding of the corresponding HR image by a tag model (e.g., RAM 2023ram in our work), enabling DAPE the degradation-awareness. (b) In the second stage, the well-trained DAPE provides both soft prompts (representation embedding) and hard prompts (tagging text), which are combined with the LR image to control a pretrained T2I model (e.g., SD rombach2022high in our work). The detailed structure of the controlled T2I diffusion model is shown in (c).
  • Figure 3: Effectiveness of the LR embedding (LRE) strategy in alleviating the discrepancy between training and inference of SD-based Real-ISR methods wang2023exploitingyang2023pixellin2023diffbir. Top row: results without using LRE. Bottom row: results with LRE. We see that many falsely generated details in the sky area are removed.
  • Figure 4: Qualitative comparisons of different Real-ISR methods. Please zoom in for a better view.