DFU: scale-robust diffusion model for zero-shot super-resolution image generation
Alex Havrilla, Kevin Rojas, Wenjing Liao, Molei Tao
TL;DR
DFU tackles the fixed-resolution limitation of diffusion models by introducing a scale-robust, multi-resolution architecture that blends spatial and spectral processing to learn the score operator across resolutions. The Dual-FNO UNet integrates Dual-Convolution within a UNet backbone and leverages infinite-dimensional diffusion principles to enable zero-shot super-resolution up to roughly $2\times$ the training resolution, with fidelity preserved by architectural design and targeted fine-tuning. Empirical results on FFHQ and LSUN-Church show that DFU with mixed-resolution training outperforms baselines and that a mixed fine-tuning scheme further improves high-resolution coherence and fidelity, achieving an $\text{FID}=11.3$ at $1.66\times$ the training resolution. This work demonstrates a practical path to high-quality diffusion-based super-resolution without high-resolution training data, with potential impact on scalable, multi-scale image generation.
Abstract
Diffusion generative models have achieved remarkable success in generating images with a fixed resolution. However, existing models have limited ability to generalize to different resolutions when training data at those resolutions are not available. Leveraging techniques from operator learning, we present a novel deep-learning architecture, Dual-FNO UNet (DFU), which approximates the score operator by combining both spatial and spectral information at multiple resolutions. Comparisons of DFU to baselines demonstrate its scalability: 1) simultaneously training on multiple resolutions improves FID over training at any single fixed resolution; 2) DFU generalizes beyond its training resolutions, allowing for coherent, high-fidelity generation at higher-resolutions with the same model, i.e. zero-shot super-resolution image-generation; 3) we propose a fine-tuning strategy to further enhance the zero-shot super-resolution image-generation capability of our model, leading to a FID of 11.3 at 1.66 times the maximum training resolution on FFHQ, which no other method can come close to achieving.
