Table of Contents
Fetching ...

Computational Tradeoffs in Image Synthesis: Diffusion, Masked-Token, and Next-Token Prediction

Maciej Kilian, Varun Jampani, Luke Zettlemoyer

TL;DR

This study conducts a compute-controlled comparison of diffusion, masked-token, and next-token latent image synthesis on Transformer backbones, trained across a grid of model sizes and dataset scales. It uses a latent autoencoding framework with continuous and discrete latents and evaluates training/inference performance via FID, CLIP, and compute budgets, applying flow-matching diffusion and autoregressive/masked-token objectives. Key findings show token-based methods deliver superior CLIP controllability and inference efficiency at low compute, while diffusion catches up in image quality as compute increases; EMA impacts diffusion more than token methods, and autoencoder quality strongly influences FID. The results yield practical guidance: diffusion is preferred for image quality and low latency, whereas next-token prediction excels in prompt following and throughput, informing deployment decisions across applications.

Abstract

Nearly every recent image synthesis approach, including diffusion, masked-token prediction, and next-token prediction, uses a Transformer network architecture. Despite this common backbone, there has been no direct, compute controlled comparison of how these approaches affect performance and efficiency. We analyze the scalability of each approach through the lens of compute budget measured in FLOPs. We find that token prediction methods, led by next-token prediction, significantly outperform diffusion on prompt following. On image quality, while next-token prediction initially performs better, scaling trends suggest it is eventually matched by diffusion. We compare the inference compute efficiency of each approach and find that next token prediction is by far the most efficient. Based on our findings we recommend diffusion for applications targeting image quality and low latency; and next-token prediction when prompt following or throughput is more important.

Computational Tradeoffs in Image Synthesis: Diffusion, Masked-Token, and Next-Token Prediction

TL;DR

This study conducts a compute-controlled comparison of diffusion, masked-token, and next-token latent image synthesis on Transformer backbones, trained across a grid of model sizes and dataset scales. It uses a latent autoencoding framework with continuous and discrete latents and evaluates training/inference performance via FID, CLIP, and compute budgets, applying flow-matching diffusion and autoregressive/masked-token objectives. Key findings show token-based methods deliver superior CLIP controllability and inference efficiency at low compute, while diffusion catches up in image quality as compute increases; EMA impacts diffusion more than token methods, and autoencoder quality strongly influences FID. The results yield practical guidance: diffusion is preferred for image quality and low latency, whereas next-token prediction excels in prompt following and throughput, informing deployment decisions across applications.

Abstract

Nearly every recent image synthesis approach, including diffusion, masked-token prediction, and next-token prediction, uses a Transformer network architecture. Despite this common backbone, there has been no direct, compute controlled comparison of how these approaches affect performance and efficiency. We analyze the scalability of each approach through the lens of compute budget measured in FLOPs. We find that token prediction methods, led by next-token prediction, significantly outperform diffusion on prompt following. On image quality, while next-token prediction initially performs better, scaling trends suggest it is eventually matched by diffusion. We compare the inference compute efficiency of each approach and find that next token prediction is by far the most efficient. Based on our findings we recommend diffusion for applications targeting image quality and low latency; and next-token prediction when prompt following or throughput is more important.
Paper Structure (22 sections, 5 equations, 8 figures, 5 tables)

This paper contains 22 sections, 5 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Images generated using our best models. Top row is from a next-token prediction model, bottom row is from a diffusion model. Both models are XL size and trained for 500k steps.
  • Figure 2: Impact of autoencoder quality on diffusion models. We train L-size diffusion models on our set of continuous latent space autoencoders. The choice of autoencoder has more impact on FID than CLIP score. Effectively using a larger latent space requires more compute and model capacity.
  • Figure 3: Training compute efficiency on perceptual metrics. Performance on CLIP and FID scores for various models and dataset sizes across different image synthesis approaches. On FID, next-token prediction is initially the most compute-efficient but scaling trends suggest it is eventually matched by diffusion. Token-based methods significantly outperform diffusion in CLIP score. Both axes are in log scale.
  • Figure 4: Training compute efficiency on final loss. All objectives follow predictable scaling trends. Right plot shows the difference in loss scale between diffusion models trained on top of different autoencoders. FLOPs axis is in log scale.
  • Figure 5: Inference compute efficiency on perceptual metrics. Diffusion and masked token prediction evaluated at 4, 10, 20, 50, and 100 sampling steps. Next token prediction is 1 forward pass factorized over each token individually. Masked token prediction isn't influenced by the number of sampling steps very much. Next token prediction is the most compute efficient. Both axes are in log scale.
  • ...and 3 more figures