Esoteric Language Models

Subham Sekhar Sahoo; Zhihan Yang; Yash Akhauri; Johnna Liu; Deepansha Singh; Zhoujun Cheng; Zhengzhong Liu; Eric Xing; John Thickstun; Arash Vahdat

Esoteric Language Models

Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, Arash Vahdat

TL;DR

Eso-LMs present a hybrid framework that fuses autoregressive and masked diffusion paradigms into a single transformer, enabling smooth interpolation between AR and diffusion perplexities. A novel unified attention mechanism and sampling strategy permit KV caching within diffusion, yielding substantial speedups while maintaining strong perplexity performance. The approach demonstrates state-of-the-art diffusion-model perplexities on standard benchmarks and achieves large speedups (up to 65x vs. vanilla MDMs) with competitive generation quality, outperforming prior semi-autoregressive methods. This hybrid, KV-enabled diffusion-arithmetic framework offers practical gains for real-time language tasks and introduces a flexible continuum between generation paradigms. The work also provides AB comparisons against BD3-LMs, highlighting robustness against mode collapse and superior speed-quality trade-offs.

Abstract

Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Among this family of models, Masked Diffusion Models (MDMs) achieve the strongest performance but still underperform AR models in perplexity and lack key inference-time efficiency features--most notably, KV caching. In this work, we introduce Eso-LMs, a new family of models that fuses AR and MDM paradigms, enabling smooth interpolation between their perplexities while overcoming their respective limitations. Eso-LMs set a new state of the art on standard language modeling benchmarks. Crucially, we are the **first to introduce KV caching for MDMs** while preserving parallel generation, significantly improving inference efficiency. Combined with an optimized sampling schedule, our method achieves up to **65x** faster inference than standard MDMs and **4x** faster inference than prior semi-autoregressive approaches. We provide the code and model checkpoints on the project page: [http://s-sahoo.github.io/Eso-LMs](http://s-sahoo.github.io/Eso-LMs)

Esoteric Language Models

TL;DR

Abstract

Esoteric Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)