Table of Contents
Fetching ...

Generative Frontiers: Why Evaluation Matters for Diffusion Language Models

Patrick Pynadath, Jiaxin Shi, Ruqi Zhang

Abstract

Diffusion language models have seen exciting recent progress, offering far more flexibility in generative trajectories than autoregressive models. This flexibility has motivated a growing body of research into new approaches to diffusion language modeling, which typically begins at the scale of GPT-2 small (150 million parameters). However, these advances introduce new issues with evaluation methodology. In this technical note, we discuss the limitations of current methodology and propose principled augmentations to ensure reliable comparisons. We first discuss why OpenWebText has become the standard benchmark, and why alternatives such as LM1B are inherently less meaningful. We then discuss the limitations of likelihood evaluations for diffusion models, and explain why relying on generative perplexity alone as a metric can lead to uninformative results. To address this, we show that generative perplexity and entropy are two components of the KL divergence to a reference distribution. This decomposition explains generative perplexity's sensitivity to entropy, and naturally suggests generative frontiers as a principled method for evaluating model generative quality. We conclude with empirical observations on model quality at this scale. We include a blog post with interactive content to illustrate the argument at https://patrickpynadath1.github.io/blog/eval_methodology/.

Generative Frontiers: Why Evaluation Matters for Diffusion Language Models

Abstract

Diffusion language models have seen exciting recent progress, offering far more flexibility in generative trajectories than autoregressive models. This flexibility has motivated a growing body of research into new approaches to diffusion language modeling, which typically begins at the scale of GPT-2 small (150 million parameters). However, these advances introduce new issues with evaluation methodology. In this technical note, we discuss the limitations of current methodology and propose principled augmentations to ensure reliable comparisons. We first discuss why OpenWebText has become the standard benchmark, and why alternatives such as LM1B are inherently less meaningful. We then discuss the limitations of likelihood evaluations for diffusion models, and explain why relying on generative perplexity alone as a metric can lead to uninformative results. To address this, we show that generative perplexity and entropy are two components of the KL divergence to a reference distribution. This decomposition explains generative perplexity's sensitivity to entropy, and naturally suggests generative frontiers as a principled method for evaluating model generative quality. We conclude with empirical observations on model quality at this scale. We include a blog post with interactive content to illustrate the argument at https://patrickpynadath1.github.io/blog/eval_methodology/.

Paper Structure

This paper contains 21 sections, 10 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: We show the example figure from pynadath2025candihybriddiscretecontinuousdiffusion, where different temperature settings result in ranking reversal. This demonstrates that single point comparisons are not informative on their own, as it is always possible to achieve different rankings via minor temperature adjustments.
  • Figure 2: We demonstrate why single point evaluations are inherently ambiguous. Even though point B may have a better perplexity, plotting out the frontiers demonstrates that point A actually is closer to the target distribution. Only when points are matched on entropy or perplexity can meaningful conclusions be drawn about distance to the reference distribution.
  • Figure 3: We visualize how single point metrics can be viewed as individual slices of the same underlying frontiers, just evaluated at different points along each curve. Thus generative frontiers provide a unifying framework for interpreting potentially disparate rankings. While single point metrics measure inference settings, frontiers compare generative model capability.
  • Figure 4: We show the empirical distribution of entropy values across the OpenWebText validation commonly used pynadath2025candihybriddiscretecontinuousdiffusionsahoo2025diffusionSahoo_Arriola_Schiff_Gokaslan_Marroquin_Chiu_Rush_Kuleshov_2024.
  • Figure 5: We illustrate how the median entropy and evaluation perplexity of the validation set of OpenWebText can be used to compare dLLMs. We use the AR Eval perplexity reported by Sahoo_Arriola_Schiff_Gokaslan_Marroquin_Chiu_Rush_Kuleshov_2024. We observe that different models excel at different regions -- when matching the entropy of natural language, Duo excels. When matching the likelihood of natural language, CANDI excels. This demonstrates that generative quality evaluation is multi-faceted.
  • ...and 1 more figures