Table of Contents
Fetching ...

Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization

Kyle Sargent, Kyle Hsu, Justin Johnson, Li Fei-Fei, Jiajun Wu

TL;DR

FlowMo introduces a transformer-based diffusion autoencoder for discrete image tokenization that achieves state-of-the-art reconstruction on ImageNet-1K without using 2D latent codes, convolutions, adversarial losses, or distillation. Its core idea splits training into mode-matching and mode-seeking stages to bias reconstruction toward perceptual modes, complemented by a shifted sampler for inference. The method attains top tokenization metrics at multiple BPPs and supports a second-stage generative model, illustrating practical utility for high-quality image synthesis from discrete tokens. Ablation studies validate the necessity of Stage 1B and the chosen sampling and noise strategies. Overall, FlowMo sets a new standard for end-to-end, transformer-only image tokenization and offers a flexible pathway for high-fidelity downstream generation.

Abstract

Since the advent of popular visual generation frameworks like VQGAN and latent diffusion models, state-of-the-art image generation systems have generally been two-stage systems that first tokenize or compress visual data into a lower-dimensional latent space before learning a generative model. Tokenizer training typically follows a standard recipe in which images are compressed and reconstructed subject to a combination of MSE, perceptual, and adversarial losses. Diffusion autoencoders have been proposed in prior work as a way to learn end-to-end perceptually-oriented image compression, but have not yet shown state-of-the-art performance on the competitive task of ImageNet-1K reconstruction. We propose FlowMo, a transformer-based diffusion autoencoder that achieves a new state-of-the-art for image tokenization at multiple compression rates without using convolutions, adversarial losses, spatially-aligned two-dimensional latent codes, or distilling from other tokenizers. Our key insight is that FlowMo training should be broken into a mode-matching pre-training stage and a mode-seeking post-training stage. In addition, we conduct extensive analyses and explore the training of generative models atop the FlowMo tokenizer. Our code and models will be available at http://kylesargent.github.io/flowmo .

Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization

TL;DR

FlowMo introduces a transformer-based diffusion autoencoder for discrete image tokenization that achieves state-of-the-art reconstruction on ImageNet-1K without using 2D latent codes, convolutions, adversarial losses, or distillation. Its core idea splits training into mode-matching and mode-seeking stages to bias reconstruction toward perceptual modes, complemented by a shifted sampler for inference. The method attains top tokenization metrics at multiple BPPs and supports a second-stage generative model, illustrating practical utility for high-quality image synthesis from discrete tokens. Ablation studies validate the necessity of Stage 1B and the chosen sampling and noise strategies. Overall, FlowMo sets a new standard for end-to-end, transformer-only image tokenization and offers a flexible pathway for high-fidelity downstream generation.

Abstract

Since the advent of popular visual generation frameworks like VQGAN and latent diffusion models, state-of-the-art image generation systems have generally been two-stage systems that first tokenize or compress visual data into a lower-dimensional latent space before learning a generative model. Tokenizer training typically follows a standard recipe in which images are compressed and reconstructed subject to a combination of MSE, perceptual, and adversarial losses. Diffusion autoencoders have been proposed in prior work as a way to learn end-to-end perceptually-oriented image compression, but have not yet shown state-of-the-art performance on the competitive task of ImageNet-1K reconstruction. We propose FlowMo, a transformer-based diffusion autoencoder that achieves a new state-of-the-art for image tokenization at multiple compression rates without using convolutions, adversarial losses, spatially-aligned two-dimensional latent codes, or distilling from other tokenizers. Our key insight is that FlowMo training should be broken into a mode-matching pre-training stage and a mode-seeking post-training stage. In addition, we conduct extensive analyses and explore the training of generative models atop the FlowMo tokenizer. Our code and models will be available at http://kylesargent.github.io/flowmo .

Paper Structure

This paper contains 26 sections, 15 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Discrete tokenizer comparison. State-of-the-art discrete tokenizers are benchmarked by encoding and reconstructing the ImageNet-1K validation dataset at $256 \times 256$ resolution. with performance measured in reconstruction FID (rFID), which trades off against compression rate as measured in bits per pixel (BPP). Whether trained for reconstruction at a low BPP (FlowMo-Lo) or high BPP (FlowMo-Hi), FlowMo achieves state-of-the art image tokenization performance compared with the respective baselines. Moreover, FlowMo is a transformer-based diffusion autoencoder which does not use convolutions, adversarial losses, or proxy objectives from auxiliary tokenizers.
  • Figure 2: Example reconstructions. Comparison of original and reconstructed images of faces and text. OpenMagViT-V2 and FlowMo-Lo are 0.07-bits per pixel tokenizers to be compared against each other. LlamaGen-32 and FlowMo-Hi are 0.22-bits per pixel tokenizers to be compared against each other. Best viewed zoomed-in in the electronic version. More comparisons are available on https://kylesargent.github.io/flowmo.
  • Figure 3: FlowMo architecture. FlowMo is a diffusion autoencoder which encodes images $x$ to a latent $\hat{c}$ which is quantized to $c$. Then, the model decodes a rectified flow velocity $v$ conditioned on $c$ as well as a noise level $t$ and noised image $x_t$.
  • Figure 4: Stage 1A. The encoder and decoder are trained end-to-end with output losses $\mathcal{L}_\text{perc}, \mathcal{L}_\text{flow}$ and latent losses $\mathcal{L}_\text{commit}, \mathcal{L}_\text{ent}$.
  • Figure 5: Stage 1B. The frozen encoder $e_\theta$ encodes the input image to $c$ to condition the decoder $d_\theta$, which is trained via backpropagation through the entire sampling chain. We also co-train with $\mathcal{L}_{\mathrm{flow}}$, which is the same as in Figure \ref{['fig:stage_1a']}.
  • ...and 8 more figures