Table of Contents
Fetching ...

Non-Markovian Discrete Diffusion with Causal Language Models

Yangtian Zhang, Sizhuang He, Daniel Levine, Lawrence Zhao, David Zhang, Syed A Rizvi, Shiyang Zhang, Emanuele Zappala, Rex Ying, David van Dijk

TL;DR

CaDDi (Causal Discrete Diffusion Model), a discrete diffusion model that conditions on the entire generative trajectory, thereby lifting the Markov constraint and allowing the model to revisit and improve past states.

Abstract

Discrete diffusion models offer a flexible, controllable approach to structured sequence generation, yet they still lag behind causal language models in expressive power. A key limitation lies in their reliance on the Markovian assumption, which restricts each step to condition only on the current state, leading to potential uncorrectable error accumulation. In this paper, we introduce CaDDi (Causal Discrete Diffusion Model), a discrete diffusion model that conditions on the entire generative trajectory, thereby lifting the Markov constraint and allowing the model to revisit and improve past states. By unifying sequential (causal) and temporal (diffusion) reasoning in a single non-Markovian transformer, CaDDi also treats standard causal language models as a special case and permits the direct reuse of pretrained LLM weights with no architectural changes. Empirically, CaDDi outperforms state-of-the-art discrete diffusion baselines on natural-language benchmarks, substantially narrowing the remaining gap to large autoregressive transformers.

Non-Markovian Discrete Diffusion with Causal Language Models

TL;DR

CaDDi (Causal Discrete Diffusion Model), a discrete diffusion model that conditions on the entire generative trajectory, thereby lifting the Markov constraint and allowing the model to revisit and improve past states.

Abstract

Discrete diffusion models offer a flexible, controllable approach to structured sequence generation, yet they still lag behind causal language models in expressive power. A key limitation lies in their reliance on the Markovian assumption, which restricts each step to condition only on the current state, leading to potential uncorrectable error accumulation. In this paper, we introduce CaDDi (Causal Discrete Diffusion Model), a discrete diffusion model that conditions on the entire generative trajectory, thereby lifting the Markov constraint and allowing the model to revisit and improve past states. By unifying sequential (causal) and temporal (diffusion) reasoning in a single non-Markovian transformer, CaDDi also treats standard causal language models as a special case and permits the direct reuse of pretrained LLM weights with no architectural changes. Empirically, CaDDi outperforms state-of-the-art discrete diffusion baselines on natural-language benchmarks, substantially narrowing the remaining gap to large autoregressive transformers.

Paper Structure

This paper contains 78 sections, 5 theorems, 57 equations, 12 figures, 5 tables, 1 algorithm.

Key Result

Proposition 3.1

An absorbing-state non-Markovian discrete diffusion process with marginal transition kernel $\bar{\mathbf{Q}}_t = \left(1-\alpha_t\right) \mathbf{I}+\alpha_t \mathbf{1} e_m^{\top}$ admits a bijection to an absorbing-state Markovian discrete diffusion process with marginal transition kernel $\bar{\ma

Figures (12)

  • Figure 1: (a). Inference paradigm for a standard causal language model versus CaDDi-AR. In CaDDi-AR, each timestep first autoregressively denoises the tokens into $\widetilde{\mathbf{x}}_0$, then re-applies noise via the diffusion kernel to obtain ${\mathbf{x}}_{t-1}$. A traditional autoregressive model emerges as the special case of $T=1$, which can be adapted to discrete diffusion by fine-tuning. (b). Extending 1D to 2D Rotary Positional Encoding. Standard rotary encodings for token positions are seamlessly generalized to also encode diffusion timesteps, remaining fully backward-compatible with existing language model architectures.
  • Figure 2: Inference for Non-Markovian Discrete Diffusion
  • Figure 3: Semi-Speculative Decoding with CaDDi-AR: The model verifies all tokens in parallel to identify the first rejection index $i$, then resumes sampling from that point.
  • Figure 4: Generation performance under manually injected noise at different timestep
  • Figure 5: Illustration of vanilla block-wise generation of CaDDi. Figure \ref{['fig:caddi_block_mask']} shows the attention mask of vanilla block-wise generation. The block-level causal mask allows bidirectional attention within each time point and causal attention over the time points. Figure \ref{['fig:caddi_block_generation']} shows the generation scheme. Note that the model itself predicts the clean data $x_0$ in practice but the figure highlights the next sampled time point (in color) for clarity.
  • ...and 7 more figures

Theorems & Definitions (10)

  • Proposition 3.1
  • Proposition 3.2
  • Definition A.1: Markovian absorbing diffusion
  • Definition A.2: Non--Markovian absorbing diffusion
  • Lemma A.3: Markovian suffix information
  • proof : Proof 1 (Direct Expansion)
  • proof : Proof 2 (Conditional Entropy)
  • Lemma A.4: Non--Markovian suffix information
  • proof
  • Proposition A.5: Equivalence in mutual-information decay