DFU: scale-robust diffusion model for zero-shot super-resolution image generation

Alex Havrilla; Kevin Rojas; Wenjing Liao; Molei Tao

DFU: scale-robust diffusion model for zero-shot super-resolution image generation

Alex Havrilla, Kevin Rojas, Wenjing Liao, Molei Tao

TL;DR

DFU tackles the fixed-resolution limitation of diffusion models by introducing a scale-robust, multi-resolution architecture that blends spatial and spectral processing to learn the score operator across resolutions. The Dual-FNO UNet integrates Dual-Convolution within a UNet backbone and leverages infinite-dimensional diffusion principles to enable zero-shot super-resolution up to roughly $2\times$ the training resolution, with fidelity preserved by architectural design and targeted fine-tuning. Empirical results on FFHQ and LSUN-Church show that DFU with mixed-resolution training outperforms baselines and that a mixed fine-tuning scheme further improves high-resolution coherence and fidelity, achieving an $\text{FID}=11.3$ at $1.66\times$ the training resolution. This work demonstrates a practical path to high-quality diffusion-based super-resolution without high-resolution training data, with potential impact on scalable, multi-scale image generation.

Abstract

Diffusion generative models have achieved remarkable success in generating images with a fixed resolution. However, existing models have limited ability to generalize to different resolutions when training data at those resolutions are not available. Leveraging techniques from operator learning, we present a novel deep-learning architecture, Dual-FNO UNet (DFU), which approximates the score operator by combining both spatial and spectral information at multiple resolutions. Comparisons of DFU to baselines demonstrate its scalability: 1) simultaneously training on multiple resolutions improves FID over training at any single fixed resolution; 2) DFU generalizes beyond its training resolutions, allowing for coherent, high-fidelity generation at higher-resolutions with the same model, i.e. zero-shot super-resolution image-generation; 3) we propose a fine-tuning strategy to further enhance the zero-shot super-resolution image-generation capability of our model, leading to a FID of 11.3 at 1.66 times the maximum training resolution on FFHQ, which no other method can come close to achieving.

DFU: scale-robust diffusion model for zero-shot super-resolution image generation

TL;DR

the training resolution, with fidelity preserved by architectural design and targeted fine-tuning. Empirical results on FFHQ and LSUN-Church show that DFU with mixed-resolution training outperforms baselines and that a mixed fine-tuning scheme further improves high-resolution coherence and fidelity, achieving an

the training resolution. This work demonstrates a practical path to high-quality diffusion-based super-resolution without high-resolution training data, with potential impact on scalable, multi-scale image generation.

Abstract

Paper Structure (23 sections, 8 equations, 13 figures, 5 tables)

This paper contains 23 sections, 8 equations, 13 figures, 5 tables.

Introduction
Related Work
Designing and training Dual-FNO UNet
Learning the score operator
Designing Dual-FNO UNet
Experiments
Setup
Baselines
DFU generalizes to higher resolutions
Resolution training mixture impacts zero-shot super-resolution image-generation
Mixed-resolution training improves single-resolution generation
Diffusion and Diffusion Generative Modeling in Infinite Dimensions
Dual FNO Precise Definitions
Images as discretizations of functions
FNO Blocks
...and 8 more sections

Figures (13)

Figure 1: Visual comparison of DFU to various baselines at zero-shot super-resolution image generation. Panel 1: Single-resolution UNet trained on data with resolution up to 96x96, sampled at 128x128. Panel 2: Multi-resolution UNet sampled at 128x128. Panel 3: Multi-resolution DFU sampled at 128x128. Note both UNets struggle to generalize global coherency past the max training resolution $r=96$.
Figure 2: Sampled resolution versus FID of multi-res UNet and DFU. DFU has lower FID before losing local coherence at 2x training resolution.
Figure 3: Samples from DFU across resolutions. Left: DFU sampled at $r=$ 32, 64, 128. Right: DFU sampled at $r=$ 160. DFU is trained on a mixture of resolutions from $r=$ 32 to 96.
Figure 4: Left: DFU architecture. Right: Dual-FNO Block. Right is integrated in left by connecting arrows of corresponding line styles (solid for passing matrices, short dashed for skip connection, and long dashed for time embedding).
Figure 5: Comparison of pre-trained DFU to bilinear upsampling. Left:$r=$96 image bilinearly upsampled to 160x160. Middle: DFU sampled at 160x160. Right: Ground truth 160x160 image included as a reference for quality. The bilinearly upsampled image is able to maintain good coherency but lacks fidelity. In contrast DFU maintains both coherence and fidelity comparable to the ground truth.
...and 8 more figures

Theorems & Definitions (1)

Remark 1

DFU: scale-robust diffusion model for zero-shot super-resolution image generation

TL;DR

Abstract

DFU: scale-robust diffusion model for zero-shot super-resolution image generation

Authors

TL;DR

Abstract

Table of Contents

Figures (13)

Theorems & Definitions (1)