Table of Contents
Fetching ...

Denoising Diffusion Probabilistic Models for Robust Image Super-Resolution in the Wild

Hshmat Sahak, Daniel Watson, Chitwan Saharia, David Fleet

TL;DR

SR3+ addresses blind single-image super-resolution in the wild, where degradations are unknown and out-of-distribution. It combines a diffusion-based denoiser with two innovations: higher-order degradations during training and noise conditioning augmentation, enabling robust, texture-rich reconstructions. The approach achieves state-of-the-art FID on RealSR and DRealSR zero-shot benchmarks, outperforming Real-ESRGAN and SR3 when trained on comparable data, with further gains from larger models and datasets. It also provides a tunable test-time mechanism (t_eval) to balance fidelity to the input and perceptual realism, paving the way for more robust diffusion-based super-resolution in practical, uncontrolled settings.

Abstract

Diffusion models have shown promising results on single-image super-resolution and other image- to-image translation tasks. Despite this success, they have not outperformed state-of-the-art GAN models on the more challenging blind super-resolution task, where the input images are out of distribution, with unknown degradations. This paper introduces SR3+, a diffusion-based model for blind super-resolution, establishing a new state-of-the-art. To this end, we advocate self-supervised training with a combination of composite, parameterized degradations for self-supervised training, and noise-conditioing augmentation during training and testing. With these innovations, a large-scale convolutional architecture, and large-scale datasets, SR3+ greatly outperforms SR3. It outperforms Real-ESRGAN when trained on the same data, with a DRealSR FID score of 36.82 vs. 37.22, which further improves to FID of 32.37 with larger models, and further still with larger training sets.

Denoising Diffusion Probabilistic Models for Robust Image Super-Resolution in the Wild

TL;DR

SR3+ addresses blind single-image super-resolution in the wild, where degradations are unknown and out-of-distribution. It combines a diffusion-based denoiser with two innovations: higher-order degradations during training and noise conditioning augmentation, enabling robust, texture-rich reconstructions. The approach achieves state-of-the-art FID on RealSR and DRealSR zero-shot benchmarks, outperforming Real-ESRGAN and SR3 when trained on comparable data, with further gains from larger models and datasets. It also provides a tunable test-time mechanism (t_eval) to balance fidelity to the input and perceptual realism, paving the way for more robust diffusion-based super-resolution in practical, uncontrolled settings.

Abstract

Diffusion models have shown promising results on single-image super-resolution and other image- to-image translation tasks. Despite this success, they have not outperformed state-of-the-art GAN models on the more challenging blind super-resolution task, where the input images are out of distribution, with unknown degradations. This paper introduces SR3+, a diffusion-based model for blind super-resolution, establishing a new state-of-the-art. To this end, we advocate self-supervised training with a combination of composite, parameterized degradations for self-supervised training, and noise-conditioing augmentation during training and testing. With these innovations, a large-scale convolutional architecture, and large-scale datasets, SR3+ greatly outperforms SR3. It outperforms Real-ESRGAN when trained on the same data, with a DRealSR FID score of 36.82 vs. 37.22, which further improves to FID of 32.37 with larger models, and further still with larger training sets.
Paper Structure (13 sections, 3 equations, 11 figures, 2 tables)

This paper contains 13 sections, 3 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Blind super-resolution test results ($64\!\times\!64 \rightarrow 256\!\times\! 256$) for SR3+, SR3 and Real-ESRGAN.
  • Figure 2: The SR3+ data pipeline applies a sequence of degradations to HR training images (like Real-ESRGAN but without additive noise). To form the conditioning signal for the neural denoiser, we up-sample the LR image and applied noise conditioning augmentation.
  • Figure 3: Sample comparison between Real-ESRGAN and various SR3+ models (ours). We observe that Real-ESRGAN often suffers from oversmoothing and excessive contrast, while SR3+ is capable of generating high-fidelity, realistic textures.
  • Figure 4: Ablation samples ($t_{eval}\!=\! 0.1$), illustrating the importance of higher-order degradations and noise conditioning augmentation.
  • Figure 5: Samples from SR3+ (400M weights, 61M dataset) using different amounts of test-time noise conditioning augmentation, $t_{eval}$.
  • ...and 6 more figures