Table of Contents
Fetching ...

DFU: scale-robust diffusion model for zero-shot super-resolution image generation

Alex Havrilla, Kevin Rojas, Wenjing Liao, Molei Tao

TL;DR

DFU tackles the fixed-resolution limitation of diffusion models by introducing a scale-robust, multi-resolution architecture that blends spatial and spectral processing to learn the score operator across resolutions. The Dual-FNO UNet integrates Dual-Convolution within a UNet backbone and leverages infinite-dimensional diffusion principles to enable zero-shot super-resolution up to roughly $2\times$ the training resolution, with fidelity preserved by architectural design and targeted fine-tuning. Empirical results on FFHQ and LSUN-Church show that DFU with mixed-resolution training outperforms baselines and that a mixed fine-tuning scheme further improves high-resolution coherence and fidelity, achieving an $\text{FID}=11.3$ at $1.66\times$ the training resolution. This work demonstrates a practical path to high-quality diffusion-based super-resolution without high-resolution training data, with potential impact on scalable, multi-scale image generation.

Abstract

Diffusion generative models have achieved remarkable success in generating images with a fixed resolution. However, existing models have limited ability to generalize to different resolutions when training data at those resolutions are not available. Leveraging techniques from operator learning, we present a novel deep-learning architecture, Dual-FNO UNet (DFU), which approximates the score operator by combining both spatial and spectral information at multiple resolutions. Comparisons of DFU to baselines demonstrate its scalability: 1) simultaneously training on multiple resolutions improves FID over training at any single fixed resolution; 2) DFU generalizes beyond its training resolutions, allowing for coherent, high-fidelity generation at higher-resolutions with the same model, i.e. zero-shot super-resolution image-generation; 3) we propose a fine-tuning strategy to further enhance the zero-shot super-resolution image-generation capability of our model, leading to a FID of 11.3 at 1.66 times the maximum training resolution on FFHQ, which no other method can come close to achieving.

DFU: scale-robust diffusion model for zero-shot super-resolution image generation

TL;DR

DFU tackles the fixed-resolution limitation of diffusion models by introducing a scale-robust, multi-resolution architecture that blends spatial and spectral processing to learn the score operator across resolutions. The Dual-FNO UNet integrates Dual-Convolution within a UNet backbone and leverages infinite-dimensional diffusion principles to enable zero-shot super-resolution up to roughly the training resolution, with fidelity preserved by architectural design and targeted fine-tuning. Empirical results on FFHQ and LSUN-Church show that DFU with mixed-resolution training outperforms baselines and that a mixed fine-tuning scheme further improves high-resolution coherence and fidelity, achieving an at the training resolution. This work demonstrates a practical path to high-quality diffusion-based super-resolution without high-resolution training data, with potential impact on scalable, multi-scale image generation.

Abstract

Diffusion generative models have achieved remarkable success in generating images with a fixed resolution. However, existing models have limited ability to generalize to different resolutions when training data at those resolutions are not available. Leveraging techniques from operator learning, we present a novel deep-learning architecture, Dual-FNO UNet (DFU), which approximates the score operator by combining both spatial and spectral information at multiple resolutions. Comparisons of DFU to baselines demonstrate its scalability: 1) simultaneously training on multiple resolutions improves FID over training at any single fixed resolution; 2) DFU generalizes beyond its training resolutions, allowing for coherent, high-fidelity generation at higher-resolutions with the same model, i.e. zero-shot super-resolution image-generation; 3) we propose a fine-tuning strategy to further enhance the zero-shot super-resolution image-generation capability of our model, leading to a FID of 11.3 at 1.66 times the maximum training resolution on FFHQ, which no other method can come close to achieving.
Paper Structure (23 sections, 8 equations, 13 figures, 5 tables)

This paper contains 23 sections, 8 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Visual comparison of DFU to various baselines at zero-shot super-resolution image generation. Panel 1: Single-resolution UNet trained on data with resolution up to 96x96, sampled at 128x128. Panel 2: Multi-resolution UNet sampled at 128x128. Panel 3: Multi-resolution DFU sampled at 128x128. Note both UNets struggle to generalize global coherency past the max training resolution $r=96$.
  • Figure 2: Sampled resolution versus FID of multi-res UNet and DFU. DFU has lower FID before losing local coherence at 2x training resolution.
  • Figure 3: Samples from DFU across resolutions. Left: DFU sampled at $r=$ 32, 64, 128. Right: DFU sampled at $r=$ 160. DFU is trained on a mixture of resolutions from $r=$ 32 to 96.
  • Figure 4: Left: DFU architecture. Right: Dual-FNO Block. Right is integrated in left by connecting arrows of corresponding line styles (solid for passing matrices, short dashed for skip connection, and long dashed for time embedding).
  • Figure 5: Comparison of pre-trained DFU to bilinear upsampling. Left:$r=$96 image bilinearly upsampled to 160x160. Middle: DFU sampled at 160x160. Right: Ground truth 160x160 image included as a reference for quality. The bilinearly upsampled image is able to maintain good coherency but lacks fidelity. In contrast DFU maintains both coherence and fidelity comparable to the ground truth.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Remark 1