Table of Contents
Fetching ...

Anchored Diffusion Language Model

Litu Rout, Constantine Caramanis, Sanjay Shakkottai

TL;DR

This work introduces Anchored Diffusion Language Models (ADLM), a two-stage approach that uses anchor tokens to guide diffusion-based denoising and closes the gap with autoregressive models in both likelihood and generated-text quality. The anchor network predicts distributions over important tokens, and the denoiser uses these anchors to more accurately reconstruct masked tokens, with training governed by the Anchored Negative Evidence Lower Bound (ANELBO). The framework yields substantial perplexity gains on LM1B and OpenWebText, achieves state-of-the-art zero-shot performance on several benchmarks, and can even surpass autoregressive models in MAUVE-based human-likeness scores when using remasking samplers. The anchoring principle also extends to autoregressive models, where Anchored Chain-of-Thought (ACoT) improves math and logic reasoning; collectively, anchoring reduces sample complexity and enhances reasoning and generalization across language modeling tasks.

Abstract

Diffusion Language Models (DLMs) promise parallel generation and bidirectional context, yet they underperform autoregressive (AR) models in both likelihood modeling and generated text quality. We identify that this performance gap arises when important tokens (e.g., key words or low-frequency words that anchor a sentence) are masked early in the forward process, limiting contextual information for accurate reconstruction. To address this, we introduce the Anchored Diffusion Language Model (ADLM), a novel two-stage framework that first predicts distributions over important tokens via an anchor network, and then predicts the likelihoods of missing tokens conditioned on the anchored predictions. ADLM significantly improves test perplexity on LM1B and OpenWebText, achieving up to 25.4% gains over prior DLMs, and narrows the gap with strong AR baselines. It also achieves state-of-the-art performance in zero-shot generalization across seven benchmarks and surpasses AR models in MAUVE score, which marks the first time a DLM generates better human-like text than an AR model. Theoretically, we derive an Anchored Negative Evidence Lower Bound (ANELBO) objective and show that anchoring improves sample complexity and likelihood modeling. Beyond diffusion, anchoring boosts performance in AR models and enhances reasoning in math and logic tasks, outperforming existing chain-of-thought approaches

Anchored Diffusion Language Model

TL;DR

This work introduces Anchored Diffusion Language Models (ADLM), a two-stage approach that uses anchor tokens to guide diffusion-based denoising and closes the gap with autoregressive models in both likelihood and generated-text quality. The anchor network predicts distributions over important tokens, and the denoiser uses these anchors to more accurately reconstruct masked tokens, with training governed by the Anchored Negative Evidence Lower Bound (ANELBO). The framework yields substantial perplexity gains on LM1B and OpenWebText, achieves state-of-the-art zero-shot performance on several benchmarks, and can even surpass autoregressive models in MAUVE-based human-likeness scores when using remasking samplers. The anchoring principle also extends to autoregressive models, where Anchored Chain-of-Thought (ACoT) improves math and logic reasoning; collectively, anchoring reduces sample complexity and enhances reasoning and generalization across language modeling tasks.

Abstract

Diffusion Language Models (DLMs) promise parallel generation and bidirectional context, yet they underperform autoregressive (AR) models in both likelihood modeling and generated text quality. We identify that this performance gap arises when important tokens (e.g., key words or low-frequency words that anchor a sentence) are masked early in the forward process, limiting contextual information for accurate reconstruction. To address this, we introduce the Anchored Diffusion Language Model (ADLM), a novel two-stage framework that first predicts distributions over important tokens via an anchor network, and then predicts the likelihoods of missing tokens conditioned on the anchored predictions. ADLM significantly improves test perplexity on LM1B and OpenWebText, achieving up to 25.4% gains over prior DLMs, and narrows the gap with strong AR baselines. It also achieves state-of-the-art performance in zero-shot generalization across seven benchmarks and surpasses AR models in MAUVE score, which marks the first time a DLM generates better human-like text than an AR model. Theoretically, we derive an Anchored Negative Evidence Lower Bound (ANELBO) objective and show that anchoring improves sample complexity and likelihood modeling. Beyond diffusion, anchoring boosts performance in AR models and enhances reasoning in math and logic tasks, outperforming existing chain-of-thought approaches

Paper Structure

This paper contains 38 sections, 5 theorems, 69 equations, 5 figures, 12 tables, 1 algorithm.

Key Result

Theorem 4.1

Suppose the inference posterior is parameterized as in (eq:inf-post-adlm). Denote by $\theta$ the collection of parameters of the anchor and denoiser networks, i.e., $\theta = [\psi, \varphi]$. Given a sequence ${\mathbf{x}} = ({\mathbf{x}}^l)_{l=1}^L$, let the important token mixture ${\mathbf{y}} with weight $\lambda_{t(i)} = \frac{(1-\sigma_{t(i)})\alpha_{t(i)} - \alpha_{s(i)}}{1-\alpha_{t(i)}

Figures (5)

  • Figure 1: Anchored Diffusion Language Model (ADLM). ADLM introduces an anchor network that predicts important (e.g., 349 ('cat') and 329 ('dog')) token mixture of a sequence. These anchored predictions guide a denoiser network to better estimate the likelihoods of masked (50257) tokens. Here, we illustrate the pathways for tokens: 1760 ('playing') and 64 ('a'). ADLM anchors through important tokens that help narrow the performance gap with autoregressive models.
  • Figure 2: Training loss and validation PPL versus number of iterations on OWT. We train both MDLM mdlm and our ADLM model for 2M iterations (524B tokens). As discussed in §\ref{['sec:theory']}, anchoring improves the sample complexity during training, resulting in faster convergence and lower validation perplexity. While the anchor loss is part of the training objective, we only visualize the NELBO here for a direct comparison with MDLM.
  • Figure 3: Training of standard autoregressive (AR) models. A neural network is trained to predict the next token using causal attention (left-to-right context). All tokens contribute equally to the training loss, and the model treats the sequence uniformly without structural guidance.
  • Figure 4: Training of anchored autoregressive (A2R) models. An anchor network first identifies important tokens (e.g., 'cat', 'dog' shown in blue), which are supervised via an auxiliary anchor loss. A lightweight LLM is then trained to predict the next token based on anchored predictions.
  • Figure 5: Multi-stage training pipeline for Anchored Chain-of-Thought (ACoT). Here, [BOA] and [EOA] denote the beginning and end of anchors, respectively. Many reasoning traces contain redundant information, increasing entropy and making the reasoning process harder to learn. By supervising the model through a small set of important tokens extracted from the reasoning trace, ACoT encourages more structured intermediate computations, guiding the model to reason in a more targeted and interpretable way. To reduce the number of additional tokens produced, we drop reasoning tokens for every [ANT] insertion. For example, we drop $r_1$ in Stage-1 and $r_1, r_2$ in Stage-2 to demonstrate this phenomenon.

Theorems & Definitions (9)

  • Theorem 4.1: Anchored Negative Evidence Lower Bound
  • Remark 4.2
  • Proposition 4.4: Reduced Sample Complexity via Anchoring
  • Theorem A.1: Anchored Negative Evidence Lower Bound
  • proof
  • Proposition A.3: Reduced Sample Complexity via Anchoring
  • Remark A.4
  • Theorem A.6: Monotonic Improvement of Anchored Likelihood
  • proof