Table of Contents
Fetching ...

A Hybrid Wavelet-Fourier Method for Next-Generation Conditional Diffusion Models

Andrew Kiruluta, Andreas Lemos

TL;DR

We address limitations of pixel-space diffusion by introducing Wavelet-Fourier-Diffusion, a hybrid framework that operates in a frequency domain using wavelet sub-bands and a partial Fourier transform to jointly capture local detail and global structure. The method employs a forward diffusion with frequency-space corruption and a two-branch U-Net for the reverse process, enabling conditional generation via cross-attention. Experimental results on CIFAR-10, CelebA-HQ, and a conditional ImageNet subset show competitive FID and IS, with improved control over global coherence and texture. This work opens new avenues for multi-scale, frequency-aware diffusion with potential extensions to larger scales and additional transform domains.

Abstract

We present a novel generative modeling framework,Wavelet-Fourier-Diffusion, which adapts the diffusion paradigm to hybrid frequency representations in order to synthesize high-quality, high-fidelity images with improved spatial localization. In contrast to conventional diffusion models that rely exclusively on additive noise in pixel space, our approach leverages a multi-transform that combines wavelet sub-band decomposition with partial Fourier steps. This strategy progressively degrades and then reconstructs images in a hybrid spectral domain during the forward and reverse diffusion processes. By supplementing traditional Fourier-based analysis with the spatial localization capabilities of wavelets, our model can capture both global structures and fine-grained features more effectively. We further extend the approach to conditional image generation by integrating embeddings or conditional features via cross-attention. Experimental evaluations on CIFAR-10, CelebA-HQ, and a conditional ImageNet subset illustrate that our method achieves competitive or superior performance relative to baseline diffusion models and state-of-the-art GANs, as measured by Fréchet Inception Distance (FID) and Inception Score (IS). We also show how the hybrid frequency-based representation improves control over global coherence and fine texture synthesis, paving the way for new directions in multi-scale generative modeling.

A Hybrid Wavelet-Fourier Method for Next-Generation Conditional Diffusion Models

TL;DR

We address limitations of pixel-space diffusion by introducing Wavelet-Fourier-Diffusion, a hybrid framework that operates in a frequency domain using wavelet sub-bands and a partial Fourier transform to jointly capture local detail and global structure. The method employs a forward diffusion with frequency-space corruption and a two-branch U-Net for the reverse process, enabling conditional generation via cross-attention. Experimental results on CIFAR-10, CelebA-HQ, and a conditional ImageNet subset show competitive FID and IS, with improved control over global coherence and texture. This work opens new avenues for multi-scale, frequency-aware diffusion with potential extensions to larger scales and additional transform domains.

Abstract

We present a novel generative modeling framework,Wavelet-Fourier-Diffusion, which adapts the diffusion paradigm to hybrid frequency representations in order to synthesize high-quality, high-fidelity images with improved spatial localization. In contrast to conventional diffusion models that rely exclusively on additive noise in pixel space, our approach leverages a multi-transform that combines wavelet sub-band decomposition with partial Fourier steps. This strategy progressively degrades and then reconstructs images in a hybrid spectral domain during the forward and reverse diffusion processes. By supplementing traditional Fourier-based analysis with the spatial localization capabilities of wavelets, our model can capture both global structures and fine-grained features more effectively. We further extend the approach to conditional image generation by integrating embeddings or conditional features via cross-attention. Experimental evaluations on CIFAR-10, CelebA-HQ, and a conditional ImageNet subset illustrate that our method achieves competitive or superior performance relative to baseline diffusion models and state-of-the-art GANs, as measured by Fréchet Inception Distance (FID) and Inception Score (IS). We also show how the hybrid frequency-based representation improves control over global coherence and fine texture synthesis, paving the way for new directions in multi-scale generative modeling.

Paper Structure

This paper contains 19 sections, 6 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: A schematic illustration of the proposed Wavelet-Fourier-Diffusion architecture, showing separate forward (left-to-right) and reverse (right-to-left) processes. In the $\textbf{forward pass}$, an input image $x_0$ undergoes a wavelet transform, producing a low-frequency sub-band $x_0^{\textrm{LF}}$ and multiple high-frequency sub-bands $\{x_0^{\textrm{HF},k}\}$. The low-frequency sub-band is then partially converted to the Fourier domain, yielding $X_0$. A corruption operator $\mathcal{M}t$ progressively degrades both $X_0$ and the wavelet high-frequency coefficients across T steps, ultimately producing $(X_T,\{x_T^{\textrm{HF},k}\})$, a heavily distorted representation. In the $\textbf{reverse pass}$, a conditional U-Net $\Phi{\theta}(\cdot,t,c)$ receives the corrupted frequency components at each diffusion step and predicts the restored wavelet-Fourier representation $(\widehat{X}_0,\{\widehat{x}_0^{\textrm{HF},k}\})$. An inverse Fourier transform reconstructs the low-frequency band $\widehat{x}_0^{\textrm{LF}}$, which is then merged with the recovered high-frequency sub-bands through an inverse wavelet transform to yield the final synthesized image $\hat{x}_0$. By blending localized sub-band decomposition with partial global frequency analysis, this approach enhances control over both coarse structures and high-frequency details, while enabling flexible conditional generation through cross attention in the U Net.