Anchored Diffusion Language Model
Litu Rout, Constantine Caramanis, Sanjay Shakkottai
TL;DR
This work introduces Anchored Diffusion Language Models (ADLM), a two-stage approach that uses anchor tokens to guide diffusion-based denoising and closes the gap with autoregressive models in both likelihood and generated-text quality. The anchor network predicts distributions over important tokens, and the denoiser uses these anchors to more accurately reconstruct masked tokens, with training governed by the Anchored Negative Evidence Lower Bound (ANELBO). The framework yields substantial perplexity gains on LM1B and OpenWebText, achieves state-of-the-art zero-shot performance on several benchmarks, and can even surpass autoregressive models in MAUVE-based human-likeness scores when using remasking samplers. The anchoring principle also extends to autoregressive models, where Anchored Chain-of-Thought (ACoT) improves math and logic reasoning; collectively, anchoring reduces sample complexity and enhances reasoning and generalization across language modeling tasks.
Abstract
Diffusion Language Models (DLMs) promise parallel generation and bidirectional context, yet they underperform autoregressive (AR) models in both likelihood modeling and generated text quality. We identify that this performance gap arises when important tokens (e.g., key words or low-frequency words that anchor a sentence) are masked early in the forward process, limiting contextual information for accurate reconstruction. To address this, we introduce the Anchored Diffusion Language Model (ADLM), a novel two-stage framework that first predicts distributions over important tokens via an anchor network, and then predicts the likelihoods of missing tokens conditioned on the anchored predictions. ADLM significantly improves test perplexity on LM1B and OpenWebText, achieving up to 25.4% gains over prior DLMs, and narrows the gap with strong AR baselines. It also achieves state-of-the-art performance in zero-shot generalization across seven benchmarks and surpasses AR models in MAUVE score, which marks the first time a DLM generates better human-like text than an AR model. Theoretically, we derive an Anchored Negative Evidence Lower Bound (ANELBO) objective and show that anchoring improves sample complexity and likelihood modeling. Beyond diffusion, anchoring boosts performance in AR models and enhances reasoning in math and logic tasks, outperforming existing chain-of-thought approaches
