Table of Contents
Fetching ...

Esoteric Language Models

Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, Arash Vahdat

TL;DR

Eso-LMs present a hybrid framework that fuses autoregressive and masked diffusion paradigms into a single transformer, enabling smooth interpolation between AR and diffusion perplexities. A novel unified attention mechanism and sampling strategy permit KV caching within diffusion, yielding substantial speedups while maintaining strong perplexity performance. The approach demonstrates state-of-the-art diffusion-model perplexities on standard benchmarks and achieves large speedups (up to 65x vs. vanilla MDMs) with competitive generation quality, outperforming prior semi-autoregressive methods. This hybrid, KV-enabled diffusion-arithmetic framework offers practical gains for real-time language tasks and introduces a flexible continuum between generation paradigms. The work also provides AB comparisons against BD3-LMs, highlighting robustness against mode collapse and superior speed-quality trade-offs.

Abstract

Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Among this family of models, Masked Diffusion Models (MDMs) achieve the strongest performance but still underperform AR models in perplexity and lack key inference-time efficiency features--most notably, KV caching. In this work, we introduce Eso-LMs, a new family of models that fuses AR and MDM paradigms, enabling smooth interpolation between their perplexities while overcoming their respective limitations. Eso-LMs set a new state of the art on standard language modeling benchmarks. Crucially, we are the **first to introduce KV caching for MDMs** while preserving parallel generation, significantly improving inference efficiency. Combined with an optimized sampling schedule, our method achieves up to **65x** faster inference than standard MDMs and **4x** faster inference than prior semi-autoregressive approaches. We provide the code and model checkpoints on the project page: [http://s-sahoo.github.io/Eso-LMs](http://s-sahoo.github.io/Eso-LMs)

Esoteric Language Models

TL;DR

Eso-LMs present a hybrid framework that fuses autoregressive and masked diffusion paradigms into a single transformer, enabling smooth interpolation between AR and diffusion perplexities. A novel unified attention mechanism and sampling strategy permit KV caching within diffusion, yielding substantial speedups while maintaining strong perplexity performance. The approach demonstrates state-of-the-art diffusion-model perplexities on standard benchmarks and achieves large speedups (up to 65x vs. vanilla MDMs) with competitive generation quality, outperforming prior semi-autoregressive methods. This hybrid, KV-enabled diffusion-arithmetic framework offers practical gains for real-time language tasks and introduces a flexible continuum between generation paradigms. The work also provides AB comparisons against BD3-LMs, highlighting robustness against mode collapse and superior speed-quality trade-offs.

Abstract

Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Among this family of models, Masked Diffusion Models (MDMs) achieve the strongest performance but still underperform AR models in perplexity and lack key inference-time efficiency features--most notably, KV caching. In this work, we introduce Eso-LMs, a new family of models that fuses AR and MDM paradigms, enabling smooth interpolation between their perplexities while overcoming their respective limitations. Eso-LMs set a new state of the art on standard language modeling benchmarks. Crucially, we are the **first to introduce KV caching for MDMs** while preserving parallel generation, significantly improving inference efficiency. Combined with an optimized sampling schedule, our method achieves up to **65x** faster inference than standard MDMs and **4x** faster inference than prior semi-autoregressive approaches. We provide the code and model checkpoints on the project page: [http://s-sahoo.github.io/Eso-LMs](http://s-sahoo.github.io/Eso-LMs)

Paper Structure

This paper contains 63 sections, 15 equations, 11 figures, 10 tables, 2 algorithms.

Figures (11)

  • Figure 1: Efficient generation of an example sequence with our flagship model Eso-LM (B). During Diffusion Phase, Eso-LMs denoise one or more, potentially non-neighboring mask tokens (M) per step. During Sequential Phase, Eso-LMs denoise the remaining mask tokens one at a time from left to right. Eso-LM (B) allows for KV caching in both phases using just a single unified KV cache: blue bounding boxes enclose transformer cells that are building their KV cache; a cell becomes blue once its KV cache is built. The sequences below the transformers depict tokens in their natural order.
  • Figure 1: Test perplexities (PPL; $\downarrow$) on LM1B. $^*$Reported in he2022diffusionbert. $^\P$Denotes the dataset didn't incorporate sentence packing. $^\dagger$Reported in arriola2025block. For diffusion models, we report the bound on the likelihood. $^\ddag$Reported in sahoo2025the.
  • Figure 2: Comparison of attention biases for diffusion-phase training. Orange represents 0 (attention) and gray represents $-\infty$ (no attention). The clean sequence is ${\mathbf x}=(A, B, C, D, E, F)$. After random masking, we obtain ${\mathbf z}_t=(A, M, C, M, M, F)$. The integers denote the position indices with $\mathcal{M}({\mathbf z}_t) = \{2, 4, 5\}$ and $\mathcal{C}({\mathbf z}_t)=\{1, 3, 6\}$. The random ordering is $\sigma=(3, 1, 6, 4, 5, 2)\sim \mathcal{P}_6$ with clean tokens before mask tokens. Red highlights differences between Eso-LM (A) and Eso-LM (B).
  • Figure 3: Gen. PPL vs. NFEs for models trained on OWT. Entropy values are in Tables (\ref{['table:esolmb-entropy']}, \ref{['table:mdlm-entropy']}, \ref{['table:bd3lm-entropy']}).
  • Figure 4: Eso-LM (B) establishes SOTA on the Pareto frontier of sampling speed and quality.
  • ...and 6 more figures