Table of Contents
Fetching ...

SupResDiffGAN a new approach for the Super-Resolution task

Dawid Kopeć, Wojciech Kozłowski, Maciej Wizerkaniuk, Dawid Krutul, Jan Kocoń, Maciej Zięba

TL;DR

SupResDiffGAN presents a latent-space diffusion-GAN hybrid for single-image super-resolution, addressing the speed-accuracy trade-off of diffusion models by operating in a compressed latent space and leveraging adversarial feedback. The approach encodes image pairs into latent codes via a pretrained VAE, uses a U-Net to denoise in diffusion steps conditioned on a low-resolution latent, and employs a Gaussian-noise augmented discriminator with EMA-driven step scheduling to stabilize training. Empirical results on multiple SR benchmarks show competitive LPIPS performance and markedly faster inference than traditional diffusion SR models, approaching GAN-based methods in quality. This work demonstrates a viable path toward real-time diffusion-based SR and suggests further exploration of latent diffusion and diffusion-GAN hybrids for practical deployment.

Abstract

In this work, we present SupResDiffGAN, a novel hybrid architecture that combines the strengths of Generative Adversarial Networks (GANs) and diffusion models for super-resolution tasks. By leveraging latent space representations and reducing the number of diffusion steps, SupResDiffGAN achieves significantly faster inference times than other diffusion-based super-resolution models while maintaining competitive perceptual quality. To prevent discriminator overfitting, we propose adaptive noise corruption, ensuring a stable balance between the generator and the discriminator during training. Extensive experiments on benchmark datasets show that our approach outperforms traditional diffusion models such as SR3 and I$^2$SB in efficiency and image quality. This work bridges the performance gap between diffusion- and GAN-based methods, laying the foundation for real-time applications of diffusion models in high-resolution image generation.

SupResDiffGAN a new approach for the Super-Resolution task

TL;DR

SupResDiffGAN presents a latent-space diffusion-GAN hybrid for single-image super-resolution, addressing the speed-accuracy trade-off of diffusion models by operating in a compressed latent space and leveraging adversarial feedback. The approach encodes image pairs into latent codes via a pretrained VAE, uses a U-Net to denoise in diffusion steps conditioned on a low-resolution latent, and employs a Gaussian-noise augmented discriminator with EMA-driven step scheduling to stabilize training. Empirical results on multiple SR benchmarks show competitive LPIPS performance and markedly faster inference than traditional diffusion SR models, approaching GAN-based methods in quality. This work demonstrates a viable path toward real-time diffusion-based SR and suggests further exploration of latent diffusion and diffusion-GAN hybrids for practical deployment.

Abstract

In this work, we present SupResDiffGAN, a novel hybrid architecture that combines the strengths of Generative Adversarial Networks (GANs) and diffusion models for super-resolution tasks. By leveraging latent space representations and reducing the number of diffusion steps, SupResDiffGAN achieves significantly faster inference times than other diffusion-based super-resolution models while maintaining competitive perceptual quality. To prevent discriminator overfitting, we propose adaptive noise corruption, ensuring a stable balance between the generator and the discriminator during training. Extensive experiments on benchmark datasets show that our approach outperforms traditional diffusion models such as SR3 and ISB in efficiency and image quality. This work bridges the performance gap between diffusion- and GAN-based methods, laying the foundation for real-time applications of diffusion models in high-resolution image generation.

Paper Structure

This paper contains 8 sections, 14 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 2: The training process of our proposed model. Ground truth $x_0$ and low-resolution image $x_{low}$ are embedded into latent space. The ground truth latent $z_0$ is diffused to random timestep $t$ and goes as input to the generator with low resolution latent $z_{low}$. The output of generator $\hat{z}_0$ and $z_0$ are diffused to specific timestep $s$ and decoded to pixel space where they are assessed by the discriminator which sample is real. The final loss function of the model is the mean square error between $z_0$ and $\hat{z}_0$ enriched by the adversarial loss provided by the discriminator.
  • Figure 3: The sampling process of our model. We first embed the input $x_{low}$ to the latent representation $z_{low}$. We then start the diffusion reverse process from the pure Gaussian noise and gradually remove the noise to obtain the final sample in the latent space $\hat{z}_0$ which is decoded to pixel space.
  • Figure 4: Qualitative comparison of visual performance on two example images from ImageNet. Low-quality inputs are on the left, while results from bicubic upscale and seven SR models: SRGAN, ESRGAN, Real-ESRGAN, SR3, ResShift, I$^2$SB, and Ours are on the right.
  • Figure 5: Impact of diffusion step size and sampling method on SupResDiffGAN performance, evaluated using LPIPS. Results are based on the CelebA-HQ dataset. The model maintained quality with fewer steps, significantly reducing inference time. DDPM outperformed DDIM at low step counts, but their results converged with more steps.
  • Figure : LR input
  • ...and 3 more figures