Table of Contents
Fetching ...

SeD: Semantic-Aware Discriminator for Image Super-Resolution

Bingchen Li, Xin Li, Hanxin Zhu, Yeying Jin, Ruoyu Feng, Zhizheng Zhang, Zhibo Chen

TL;DR

This work tackles the issue of coarse-grained distribution learning in SR discriminators by introducing a semantic-aware discriminator (SeD) that leverages pixel-wise semantics from pretrained vision models. A semantic-aware fusion block (SeFB) uses cross-attention to warp semantic cues into the discriminator, guiding the SR network to generate fine-grained, semantically consistent textures without increasing generator inference cost. Across classical and real-world SR benchmarks, SeD improves perceptual quality (e.g., LPIPS) while maintaining or boosting objective metrics, and ablations validate the efficacy of cross-attention fusion and CLIP RN50-based semantics. The approach is plug-and-play with existing GAN-based SR pipelines and demonstrates strong generalization on large-scale datasets, making semantic guidance a practical route to more realistic SR textures.

Abstract

Generative Adversarial Networks (GANs) have been widely used to recover vivid textures in image super-resolution (SR) tasks. In particular, one discriminator is utilized to enable the SR network to learn the distribution of real-world high-quality images in an adversarial training manner. However, the distribution learning is overly coarse-grained, which is susceptible to virtual textures and causes counter-intuitive generation results. To mitigate this, we propose the simple and effective Semantic-aware Discriminator (denoted as SeD), which encourages the SR network to learn the fine-grained distributions by introducing the semantics of images as a condition. Concretely, we aim to excavate the semantics of images from a well-trained semantic extractor. Under different semantics, the discriminator is able to distinguish the real-fake images individually and adaptively, which guides the SR network to learn the more fine-grained semantic-aware textures. To obtain accurate and abundant semantics, we take full advantage of recently popular pretrained vision models (PVMs) with extensive datasets, and then incorporate its semantic features into the discriminator through a well-designed spatial cross-attention module. In this way, our proposed semantic-aware discriminator empowered the SR network to produce more photo-realistic and pleasing images. Extensive experiments on two typical tasks, i.e., SR and Real SR have demonstrated the effectiveness of our proposed methods.

SeD: Semantic-Aware Discriminator for Image Super-Resolution

TL;DR

This work tackles the issue of coarse-grained distribution learning in SR discriminators by introducing a semantic-aware discriminator (SeD) that leverages pixel-wise semantics from pretrained vision models. A semantic-aware fusion block (SeFB) uses cross-attention to warp semantic cues into the discriminator, guiding the SR network to generate fine-grained, semantically consistent textures without increasing generator inference cost. Across classical and real-world SR benchmarks, SeD improves perceptual quality (e.g., LPIPS) while maintaining or boosting objective metrics, and ablations validate the efficacy of cross-attention fusion and CLIP RN50-based semantics. The approach is plug-and-play with existing GAN-based SR pipelines and demonstrates strong generalization on large-scale datasets, making semantic guidance a practical route to more realistic SR textures.

Abstract

Generative Adversarial Networks (GANs) have been widely used to recover vivid textures in image super-resolution (SR) tasks. In particular, one discriminator is utilized to enable the SR network to learn the distribution of real-world high-quality images in an adversarial training manner. However, the distribution learning is overly coarse-grained, which is susceptible to virtual textures and causes counter-intuitive generation results. To mitigate this, we propose the simple and effective Semantic-aware Discriminator (denoted as SeD), which encourages the SR network to learn the fine-grained distributions by introducing the semantics of images as a condition. Concretely, we aim to excavate the semantics of images from a well-trained semantic extractor. Under different semantics, the discriminator is able to distinguish the real-fake images individually and adaptively, which guides the SR network to learn the more fine-grained semantic-aware textures. To obtain accurate and abundant semantics, we take full advantage of recently popular pretrained vision models (PVMs) with extensive datasets, and then incorporate its semantic features into the discriminator through a well-designed spatial cross-attention module. In this way, our proposed semantic-aware discriminator empowered the SR network to produce more photo-realistic and pleasing images. Extensive experiments on two typical tasks, i.e., SR and Real SR have demonstrated the effectiveness of our proposed methods.
Paper Structure (24 sections, 5 equations, 13 figures, 8 tables)

This paper contains 24 sections, 5 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Comparison between the vanilla discriminator and our proposed semantic-aware discriminator.
  • Figure 2: Illustration of (a) GAN-based SR with the vanilla discriminator. (b) Our proposed semantic-aware discriminator (SeD). (c) The network structure of SeFB. (d) The network structure of P+SeD. The vanilla discriminator measures the distributions of images regardless of the semantics, which causes the SR network to learn the average textures (i.e., noise) or generate textures not related to the semantics. In contrast, our proposed semantic-aware discriminator exploits the fine-grained semantics as the condition of the discriminator, which poses the SR network to learn more fine-grained semantic-aware textures for SR.
  • Figure 3: Visual comparison (zoom-in for better view) to state-of-the-art GAN-based SR methods. We demonstrate patch-wise SeD here because it shows better subjective quality. With SeD, the SR network is capable of restoring photo-realistic textures.
  • Figure 4: Visual comparison (zoom-in for better view) to state-of-the-art GAN-based real-world SR methods. We demonstrate pixel-wise SeD here to align with previous works RealESRGANLDL.
  • Figure 5: The t-SNE visualization of discriminator features. The 7 categories are "fish", "plane", "barn", "train", "daisy", "dog", "monarch" from ImageNet, respectively.
  • ...and 8 more figures