Table of Contents
Fetching ...

Spectral Dictionary Learning for Generative Image Modeling

Andrew Kiruluta

TL;DR

This work addresses the limitations of stochastic latent models by introducing a spectral dictionary learning approach for image synthesis. Images are represented as $\hat{\mathbf{x}} = \sum_{i=1}^K w_i s_i(t)$, where each spectral atom $s_i(t)$ is parameterized by time‑varying amplitude, frequency, and phase, and modulated by a small network to capture local spectral dynamics. The dictionary is learned jointly with per‑image mixing coefficients, and a simple probabilistic prior over $\mathbf{w}$ enables deterministic generation via a single linear synthesis step; a STFT‑based loss enforces both global structure and detailed spectral content. The model yields interpretable spectral components, stable training, and efficient sampling, achieving competitive CIFAR‑10 metrics (e.g., $\text{FID}=55.4$, $\text{IS}=7.2$) while offering a controllable alternative to GANs and diffusion models. This approach opens avenues for interpretable, spectrally‑driven image manipulation and analysis, with potential extensions to higher resolutions and richer priors.

Abstract

We propose a novel spectral generative model for image synthesis that departs radically from the common variational, adversarial, and diffusion paradigms. In our approach, images, after being flattened into one-dimensional signals, are reconstructed as linear combinations of a set of learned spectral basis functions, where each basis is explicitly parameterized in terms of frequency, phase, and amplitude. The model jointly learns a global spectral dictionary with time-varying modulations and per-image mixing coefficients that quantify the contributions of each spectral component. Subsequently, a simple probabilistic model is fitted to these mixing coefficients, enabling the deterministic generation of new images by sampling from the latent space. This framework leverages deterministic dictionary learning, offering a highly interpretable and physically meaningful representation compared to methods relying on stochastic inference or adversarial training. Moreover, the incorporation of frequency-domain loss functions, computed via the short-time Fourier transform (STFT), ensures that the synthesized images capture both global structure and fine-grained spectral details, such as texture and edge information. Experimental evaluations on the CIFAR-10 benchmark demonstrate that our approach not only achieves competitive performance in terms of reconstruction quality and perceptual fidelity but also offers improved training stability and computational efficiency. This new type of generative model opens up promising avenues for controlled synthesis, as the learned spectral dictionary affords a direct handle on the intrinsic frequency content of the images, thus providing enhanced interpretability and potential for novel applications in image manipulation and analysis.

Spectral Dictionary Learning for Generative Image Modeling

TL;DR

This work addresses the limitations of stochastic latent models by introducing a spectral dictionary learning approach for image synthesis. Images are represented as , where each spectral atom is parameterized by time‑varying amplitude, frequency, and phase, and modulated by a small network to capture local spectral dynamics. The dictionary is learned jointly with per‑image mixing coefficients, and a simple probabilistic prior over enables deterministic generation via a single linear synthesis step; a STFT‑based loss enforces both global structure and detailed spectral content. The model yields interpretable spectral components, stable training, and efficient sampling, achieving competitive CIFAR‑10 metrics (e.g., , ) while offering a controllable alternative to GANs and diffusion models. This approach opens avenues for interpretable, spectrally‑driven image manipulation and analysis, with potential extensions to higher resolutions and richer priors.

Abstract

We propose a novel spectral generative model for image synthesis that departs radically from the common variational, adversarial, and diffusion paradigms. In our approach, images, after being flattened into one-dimensional signals, are reconstructed as linear combinations of a set of learned spectral basis functions, where each basis is explicitly parameterized in terms of frequency, phase, and amplitude. The model jointly learns a global spectral dictionary with time-varying modulations and per-image mixing coefficients that quantify the contributions of each spectral component. Subsequently, a simple probabilistic model is fitted to these mixing coefficients, enabling the deterministic generation of new images by sampling from the latent space. This framework leverages deterministic dictionary learning, offering a highly interpretable and physically meaningful representation compared to methods relying on stochastic inference or adversarial training. Moreover, the incorporation of frequency-domain loss functions, computed via the short-time Fourier transform (STFT), ensures that the synthesized images capture both global structure and fine-grained spectral details, such as texture and edge information. Experimental evaluations on the CIFAR-10 benchmark demonstrate that our approach not only achieves competitive performance in terms of reconstruction quality and perceptual fidelity but also offers improved training stability and computational efficiency. This new type of generative model opens up promising avenues for controlled synthesis, as the learned spectral dictionary affords a direct handle on the intrinsic frequency content of the images, thus providing enhanced interpretability and potential for novel applications in image manipulation and analysis.

Paper Structure

This paper contains 8 sections, 9 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: End‑to‑end architecture of the Spectral Dictionary Learning generative model. Mixing Coefficient Estimation: An input image $\mathbf{x}\in\mathbb{R}^{3072}$ is fed into an encoder or sparse coding module that produces a per‑image mixing vector $\mathbf{w}\in\mathbb{R}^K$. Global Spectral Dictionary: A set of $K$ spectral basis functions $s_i(t)$ is constructed from learned base parameters $(A_i^0,f_i^0,\phi_i^0)$ and a modulation network generating $\Delta A_i(t),\Delta f_i(t),\Delta\phi_i(t)$, so that $s_i(t)=\mathrm{softplus}(A_i^0+\Delta A_i(t))\, \sin\bigl(2\pi\,\mathrm{softplus}(f_i^0+\Delta f_i(t))\,t + (\phi_i^0+\Delta\phi_i(t))\bigr).$Reconstruction Synthesis & Loss: The reconstructed signal is $\hat{\mathbf{x}}(t)\;=\;\sum_{i=1}^K w_i\,s_i(t),$ and the training objective combines time‑domain MSE, $\|\mathbf{x}-\hat{\mathbf{x}}\|_2^2$, with frequency‑domain STFT loss, $\|\lvert\mathrm{STFT}(\mathbf{x})\rvert - \lvert\mathrm{STFT}(\hat{\mathbf{x}})\rvert\|_1$. Latent Prior & Generation: After training, a simple prior $p(\mathbf{w})$ (e.g. multivariate Gaussian) is fitted to the mixing vectors. New images are generated by sampling $\mathbf{w}^*\sim p(\mathbf{w})$ and synthesizing $\hat{\mathbf{x}}^*(t)=\sum_{i=1}^K w_i^*\,s_i(t).$ This pipeline yields a fully deterministic, interpretable, and efficient generative process.
  • Figure 2: Heatmap of the mixing coefficients for a sample CIFAR-10 image. Each column corresponds to a spectral component, and higher values indicate greater contribution to the image reconstruction.