Table of Contents
Fetching ...

Auto-Encoded Supervision for Perceptual Image Super-Resolution

MinKyu Lee, Sangeek Hyun, Woojin Jun, Jae-Pil Heo

TL;DR

The paper addresses the blur associated with pixel-level loss $L_{ ext{pix}}$ in perceptual SR by disentangling fidelity bias from perceptual variance, a separation that prior fixes fail to achieve. It introduces AESOP, which uses a pretrained Auto-Encoder to measure distance in the space after decoding, yielding $L_{ ext{AESOP}} = || \psi_{\text{AE}}(I^{HR}) - \psi_{\text{AE}}(I^{SR}) ||_p$ that targets the fidelity-bias term $SE$ while preserving the perceptual variance component $VE$. By replacing $L_{ ext{pix}}$ with $L_{ ext{AESOP}}$ in GAN-based SR frameworks, AESOP provides stronger reconstruction guidance without inducing blurring, leading to improved PD trade-offs and better perceptual fidelity across multiple backbones and datasets. The approach relies on a lightweight AE pretraining with $L_{ ext{pix}}$, and freezing the AE during SR training to avoid collapse, making it easy to integrate into existing SR pipelines. Overall, AESOP achieves notable gains in both distortion metrics and perceptual quality, while preserving texture realism and reducing artifacts in perceptual SR tasks.

Abstract

This work tackles the fidelity objective in the perceptual super-resolution~(SR). Specifically, we address the shortcomings of pixel-level $L_\text{p}$ loss ($\mathcal{L}_\text{pix}$) in the GAN-based SR framework. Since $L_\text{pix}$ is known to have a trade-off relationship against perceptual quality, prior methods often multiply a small scale factor or utilize low-pass filters. However, this work shows that these circumventions fail to address the fundamental factor that induces blurring. Accordingly, we focus on two points: 1) precisely discriminating the subcomponent of $L_\text{pix}$ that contributes to blurring, and 2) only guiding based on the factor that is free from this trade-off relationship. We show that they can be achieved in a surprisingly simple manner, with an Auto-Encoder (AE) pretrained with $L_\text{pix}$. Accordingly, we propose the Auto-Encoded Supervision for Optimal Penalization loss ($L_\text{AESOP}$), a novel loss function that measures distance in the AE space, instead of the raw pixel space. Note that the AE space indicates the space after the decoder, not the bottleneck. By simply substituting $L_\text{pix}$ with $L_\text{AESOP}$, we can provide effective reconstruction guidance without compromising perceptual quality. Designed for simplicity, our method enables easy integration into existing SR frameworks. Experimental results verify that AESOP can lead to favorable results in the perceptual SR task.

Auto-Encoded Supervision for Perceptual Image Super-Resolution

TL;DR

The paper addresses the blur associated with pixel-level loss in perceptual SR by disentangling fidelity bias from perceptual variance, a separation that prior fixes fail to achieve. It introduces AESOP, which uses a pretrained Auto-Encoder to measure distance in the space after decoding, yielding that targets the fidelity-bias term while preserving the perceptual variance component . By replacing with in GAN-based SR frameworks, AESOP provides stronger reconstruction guidance without inducing blurring, leading to improved PD trade-offs and better perceptual fidelity across multiple backbones and datasets. The approach relies on a lightweight AE pretraining with , and freezing the AE during SR training to avoid collapse, making it easy to integrate into existing SR pipelines. Overall, AESOP achieves notable gains in both distortion metrics and perceptual quality, while preserving texture realism and reducing artifacts in perceptual SR tasks.

Abstract

This work tackles the fidelity objective in the perceptual super-resolution~(SR). Specifically, we address the shortcomings of pixel-level loss () in the GAN-based SR framework. Since is known to have a trade-off relationship against perceptual quality, prior methods often multiply a small scale factor or utilize low-pass filters. However, this work shows that these circumventions fail to address the fundamental factor that induces blurring. Accordingly, we focus on two points: 1) precisely discriminating the subcomponent of that contributes to blurring, and 2) only guiding based on the factor that is free from this trade-off relationship. We show that they can be achieved in a surprisingly simple manner, with an Auto-Encoder (AE) pretrained with . Accordingly, we propose the Auto-Encoded Supervision for Optimal Penalization loss (), a novel loss function that measures distance in the AE space, instead of the raw pixel space. Note that the AE space indicates the space after the decoder, not the bottleneck. By simply substituting with , we can provide effective reconstruction guidance without compromising perceptual quality. Designed for simplicity, our method enables easy integration into existing SR frameworks. Experimental results verify that AESOP can lead to favorable results in the perceptual SR task.

Paper Structure

This paper contains 20 sections, 7 equations, 24 figures, 9 tables.

Figures (24)

  • Figure 1: Conceptual illustration of the proposed AESOP loss and the pixel-level $\mathcal{L}_\text{p}$ reconstruction guidance employed in typical perceptual SR methods. (a) Fidelity oriented SR network trained with $\mathcal{L}_\text{pix}$ estimates the average over plausible solutions (i.e., the optimal fidelity point). Meanwhile, perceptual SR involves a range of multiple solutions, standing around the optimal fidelity point. Thus, we identify two fundamental components of a perceptual SR image as 1) the perceptual variance factor (red line), a factor that possesses randomness and contributes to realistic textures, and 2) the fidelity bias term (orange dot), the residual blurry component of an SR image, contributing to the overall fidelity, apart from the perceptual variance. (b) Typical perceptual SR methods adopt $\mathcal{L}_\text{pix}$ for reconstruction guidance, which pushes the perceptual variance factor to vanish. Thus, when combined with perceptual quality oriented losses that encourage this variance factor, conflict arises, leading to suboptimal performance. (c) In contrast, $\mathcal{L} _\text{AESOP}$ only penalizes the fidelity bias-induced error, while preserving these critical perceptual variance factors. This ensures improved fidelity without sacrificing perceptual quality.
  • Figure 2: (a) We pretrain an Auto-Encoder $\psi_\text{AE}$ that removes perceptual variance factors, thereby establishing a feature space where only the fidelity bias factors reside. (b) The main SR network training step with the proposed $\mathcal{L} _\text{AESOP}$. By applying reconstruction objectives such as the $\mathcal{L}_\text{1}$ loss in the auto-encoded space, we can solely target the fidelity bias induced error without suffering from vanishing perceptual variance (i.e., suffer from blurring). We omit perceptual-quality-oriented losses here.
  • Figure 3: Key components of $\mathcal{L} _\text{AESOP}$ and $\mathcal{L}_\text{pix}$ on SwinIR-backbone. $\mathcal{L}_\text{pix}$ in (e) penalizes perceptually-variance factors, leading to blurry images in (b). In contrast, $\mathcal{L} _\text{AESOP}$ in (f) only penalizes based the fidelity bias (d), which enables us to obtain increased realism as in (c).
  • Figure 4: The PD trade-off curve. The backbone and training patch size are indicated; if not specified, the default patch size is 128.
  • Figure 5: Visual comparison of AESOP with baseline methods for the $\times$4 SR task on the RRDB backbone. Our method produces images with fewer visual artifacts. See the Appendix for more visual examples of AESOP's improvement in realism and fine details.
  • ...and 19 more figures