A Hybrid Wavelet-Fourier Method for Next-Generation Conditional Diffusion Models
Andrew Kiruluta, Andreas Lemos
TL;DR
We address limitations of pixel-space diffusion by introducing Wavelet-Fourier-Diffusion, a hybrid framework that operates in a frequency domain using wavelet sub-bands and a partial Fourier transform to jointly capture local detail and global structure. The method employs a forward diffusion with frequency-space corruption and a two-branch U-Net for the reverse process, enabling conditional generation via cross-attention. Experimental results on CIFAR-10, CelebA-HQ, and a conditional ImageNet subset show competitive FID and IS, with improved control over global coherence and texture. This work opens new avenues for multi-scale, frequency-aware diffusion with potential extensions to larger scales and additional transform domains.
Abstract
We present a novel generative modeling framework,Wavelet-Fourier-Diffusion, which adapts the diffusion paradigm to hybrid frequency representations in order to synthesize high-quality, high-fidelity images with improved spatial localization. In contrast to conventional diffusion models that rely exclusively on additive noise in pixel space, our approach leverages a multi-transform that combines wavelet sub-band decomposition with partial Fourier steps. This strategy progressively degrades and then reconstructs images in a hybrid spectral domain during the forward and reverse diffusion processes. By supplementing traditional Fourier-based analysis with the spatial localization capabilities of wavelets, our model can capture both global structures and fine-grained features more effectively. We further extend the approach to conditional image generation by integrating embeddings or conditional features via cross-attention. Experimental evaluations on CIFAR-10, CelebA-HQ, and a conditional ImageNet subset illustrate that our method achieves competitive or superior performance relative to baseline diffusion models and state-of-the-art GANs, as measured by Fréchet Inception Distance (FID) and Inception Score (IS). We also show how the hybrid frequency-based representation improves control over global coherence and fine texture synthesis, paving the way for new directions in multi-scale generative modeling.
