SeD: Semantic-Aware Discriminator for Image Super-Resolution
Bingchen Li, Xin Li, Hanxin Zhu, Yeying Jin, Ruoyu Feng, Zhizheng Zhang, Zhibo Chen
TL;DR
This work tackles the issue of coarse-grained distribution learning in SR discriminators by introducing a semantic-aware discriminator (SeD) that leverages pixel-wise semantics from pretrained vision models. A semantic-aware fusion block (SeFB) uses cross-attention to warp semantic cues into the discriminator, guiding the SR network to generate fine-grained, semantically consistent textures without increasing generator inference cost. Across classical and real-world SR benchmarks, SeD improves perceptual quality (e.g., LPIPS) while maintaining or boosting objective metrics, and ablations validate the efficacy of cross-attention fusion and CLIP RN50-based semantics. The approach is plug-and-play with existing GAN-based SR pipelines and demonstrates strong generalization on large-scale datasets, making semantic guidance a practical route to more realistic SR textures.
Abstract
Generative Adversarial Networks (GANs) have been widely used to recover vivid textures in image super-resolution (SR) tasks. In particular, one discriminator is utilized to enable the SR network to learn the distribution of real-world high-quality images in an adversarial training manner. However, the distribution learning is overly coarse-grained, which is susceptible to virtual textures and causes counter-intuitive generation results. To mitigate this, we propose the simple and effective Semantic-aware Discriminator (denoted as SeD), which encourages the SR network to learn the fine-grained distributions by introducing the semantics of images as a condition. Concretely, we aim to excavate the semantics of images from a well-trained semantic extractor. Under different semantics, the discriminator is able to distinguish the real-fake images individually and adaptively, which guides the SR network to learn the more fine-grained semantic-aware textures. To obtain accurate and abundant semantics, we take full advantage of recently popular pretrained vision models (PVMs) with extensive datasets, and then incorporate its semantic features into the discriminator through a well-designed spatial cross-attention module. In this way, our proposed semantic-aware discriminator empowered the SR network to produce more photo-realistic and pleasing images. Extensive experiments on two typical tasks, i.e., SR and Real SR have demonstrated the effectiveness of our proposed methods.
