Table of Contents
Fetching ...

Purrception: Variational Flow Matching for Vector-Quantized Image Generation

Răzvan-Andrei Matişan, Vincent Tao Hu, Grigory Bartosh, Björn Ommer, Cees G. M. Snoek, Max Welling, Jan-Willem van de Meent, Mohammad Mahdi Derakhshani, Floor Eijkelboom

TL;DR

Purrception tackles high-resolution image generation with vector-quantized latents by bridging continuous transport and discrete supervision. It introduces a variational flow matching framework that learns a categorical posterior over codebook indices while transporting embeddings with a continuous velocity field $v^{\theta}_t$, enabling uncertainty quantification and temperature-controlled generation. The method optimizes a cross-entropy-based VQ-VFM objective $\mathcal{L}_{Purr} = -\mathbb{E}_{t,x,z_t}[\log q_\theta(c|z_t)]$ with $v_t^{\theta}(z_t) = (\mu_t(z_t)-z_t)/(1-t)$ and a temperature parameter $\tau$ to tune fidelity versus diversity; a z-loss stabilizer further improves training. Empirically on ImageNet-1k $(256\times256)$, Purrception converges faster than both continuous FM and discrete FM baselines and achieves competitive FID scores (e.g., $\mathrm{FID}=4.72$) using a pretrained VQ encoder, demonstrating the practical viability of hybrid discrete–continuous modeling for efficient, scalable image generation.

Abstract

We introduce Purrception, a variational flow matching approach for vector-quantized image generation that provides explicit categorical supervision while maintaining continuous transport dynamics. Our method adapts Variational Flow Matching to vector-quantized latents by learning categorical posteriors over codebook indices while computing velocity fields in the continuous embedding space. This combines the geometric awareness of continuous methods with the discrete supervision of categorical approaches, enabling uncertainty quantification over plausible codes and temperature-controlled generation. We evaluate Purrception on ImageNet-1k 256x256 generation. Training converges faster than both continuous flow matching and discrete flow matching baselines while achieving competitive FID scores with state-of-the-art models. This demonstrates that Variational Flow Matching can effectively bridge continuous transport and discrete supervision for improved training efficiency in image generation.

Purrception: Variational Flow Matching for Vector-Quantized Image Generation

TL;DR

Purrception tackles high-resolution image generation with vector-quantized latents by bridging continuous transport and discrete supervision. It introduces a variational flow matching framework that learns a categorical posterior over codebook indices while transporting embeddings with a continuous velocity field , enabling uncertainty quantification and temperature-controlled generation. The method optimizes a cross-entropy-based VQ-VFM objective with and a temperature parameter to tune fidelity versus diversity; a z-loss stabilizer further improves training. Empirically on ImageNet-1k , Purrception converges faster than both continuous FM and discrete FM baselines and achieves competitive FID scores (e.g., ) using a pretrained VQ encoder, demonstrating the practical viability of hybrid discrete–continuous modeling for efficient, scalable image generation.

Abstract

We introduce Purrception, a variational flow matching approach for vector-quantized image generation that provides explicit categorical supervision while maintaining continuous transport dynamics. Our method adapts Variational Flow Matching to vector-quantized latents by learning categorical posteriors over codebook indices while computing velocity fields in the continuous embedding space. This combines the geometric awareness of continuous methods with the discrete supervision of categorical approaches, enabling uncertainty quantification over plausible codes and temperature-controlled generation. We evaluate Purrception on ImageNet-1k 256x256 generation. Training converges faster than both continuous flow matching and discrete flow matching baselines while achieving competitive FID scores with state-of-the-art models. This demonstrates that Variational Flow Matching can effectively bridge continuous transport and discrete supervision for improved training efficiency in image generation.

Paper Structure

This paper contains 27 sections, 15 equations, 7 figures, 1 table, 2 algorithms.

Figures (7)

  • Figure 1: Purrception generates high-resolution images in vector-quantized latent spaces, sampled as continuous transport learned through discrete supervision.
  • Figure 2: Purrception approach. Purrception generates high-resolution images in a vector quantized latent space. For training, we use a pretrained encoder $\mathcal{E}$ and a codebook vector of size $K$ to encode and quantize an image in latent space to obtain $z_1$. Then, we train a diffusion transformer that predicts, given a linear interpolant $z_t$, a categorical distribution over the codebook vectors for each patch of the target $z_1$ via a cross-entropy objective. For sampling, we generate a quantized latent which we further pass through the decoder $\mathcal{G}$ to obtain the image in pixel-space.
  • Figure 3: Training loss curves with and without z-loss. An additional z-loss avoids training divergence. Raw data is shown in lighter colors, while exponentially smoothed curves (EMA) are shown in bold. We used the same hyperparameters for both runs and a DiT-XL/2 backbone. EMA smoothing factor is $\alpha = 0.9$.
  • Figure 4: Convergence speed comparison on ImageNet-1k. FID-10k scores are plotted against training iterations for Purrception, CFM, and DFM. Results are shown for two DiT backbones: (a) DiT-L/2 and (b) DiT-XL/2. For Purrception, we used the softmax temperature $\tau = 0.9$ during inference for all checkpoints. The plots show that Purrception achieves lower final FID scores and converges significantly faster, matching the final performance of CFM and DFM in fewer training iterations. Full training details are provided in Appendix \ref{['sec:implementation-details']}.
  • Figure 5: Generated samples at different softmax temperatures. We can control the output of Purrception by changing the softmax temperature. A low temperature creates simpler, cleaner samples, while a high temperature adds more fine-grained details but can sometimes introduce flaws and reduce the image quality. Here we vary $\tau$ from 0.1 to 1.5.
  • ...and 2 more figures