Breaking the Factorization Barrier in Diffusion Language Models

Ian Li; Zilei Shao; Benjie Wang; Rose Yu; Guy Van den Broeck; Anji Liu

Breaking the Factorization Barrier in Diffusion Language Models

Ian Li, Zilei Shao, Benjie Wang, Rose Yu, Guy Van den Broeck, Anji Liu

TL;DR

Coupled Discrete Diffusion (CoDD), a hybrid framework that breaks this barrier by replacing the fully-factorized output distribution with a lightweight, tractable probabilistic inference layer, yields a distribution family that is significantly more expressive than standard factorized priors, enabling the modeling of complex joint dependencies, yet remains compact enough to avoid the prohibitive parameter explosion associated with full joint modeling.

Abstract

Diffusion language models theoretically allow for efficient parallel generation but are practically hindered by the "factorization barrier": the assumption that simultaneously predicted tokens are independent. This limitation forces a trade-off: models must either sacrifice speed by resolving dependencies sequentially or suffer from incoherence due to factorization. We argue that this barrier arises not from limited backbone expressivity, but from a structural misspecification: models are restricted to fully factorized outputs because explicitly parameterizing a joint distribution would require the Transformer to output a prohibitively large number of parameters. We propose Coupled Discrete Diffusion (CoDD), a hybrid framework that breaks this barrier by replacing the fully-factorized output distribution with a lightweight, tractable probabilistic inference layer. This formulation yields a distribution family that is significantly more expressive than standard factorized priors, enabling the modeling of complex joint dependencies, yet remains compact enough to avoid the prohibitive parameter explosion associated with full joint modeling. Empirically, CoDD seamlessly enhances diverse diffusion language model architectures with negligible overhead, matching the reasoning performance of computationally intensive Reinforcement Learning baselines at a fraction of the training cost. Furthermore, it prevents performance collapse in few-step generation, enabling high-quality outputs at significantly reduced latencies. Code available at: https://github.com/liuanji/CoDD

Breaking the Factorization Barrier in Diffusion Language Models

TL;DR

Abstract

Paper Structure (27 sections, 1 theorem, 27 equations, 5 figures, 6 tables, 2 algorithms)

This paper contains 27 sections, 1 theorem, 27 equations, 5 figures, 6 tables, 2 algorithms.

Introduction
Background
Diffusion Language Models
Sampling Algorithms
The Cost of Parallel Prediction
Unlocking Expressive Parallel Generation
Joint Modeling via Product Composition
Tractable Inference with Probabilistic Circuits
Training Objective
Decoding with Coupled Discrete Diffusion
Sampling Strategies
Diffusion Paradigms
Adaptive Activation
Experiments and Results
Models and Tasks
...and 12 more sections

Key Result

Theorem 4.2

For any smooth and decomposable PC $p_{\omega}$ over variables $X$, and any set of independent soft evidence $\mathcal{W}$, the quantity $P(\mathcal{W})$ can be computed exactly in time linear in the size of the circuit.

Figures (5)

Figure 1: Motivation and Intuition of CoDD.Left: Illustration of the misspecification gap. The plot reports the perplexity of LLaDA nie2025large on the MathInstruct validation set across varying mask ratios. Curve (a) Sequential generation represents the ideal baseline (i.e., the true joint distribution learned by the model). When restricted to (b) One-step generation, the independence assumption causes significant performance degradation. The shaded region highlights this loss of perplexity, defined as the misspecification gap $\mathcal{L}_{\mathrm{gap}}$. (c) CoDD significantly bridges this gap while retaining the efficiency of one-step prediction. Right: Conceptual comparison on "He is from <MASK><MASK>". (a) Sequential generation accurately resolves dependencies but sacrifices speed. (b) One-step generation predicts in parallel but assumes independence, leading to incoherent mixtures like "San York". (c) CoDD overcomes this by modulating predictions with a tractable probabilistic inference layer, recovering valid joint distributions (e.g., "San Diego") in a single parallel step.
Figure 2: Conditional Likelihood on Ground Truth. This figure illustrates the conditional log-likelihood (CLL) of the CoDD and Dream models evaluated directly on ground truth question-answer pairs from the full Math Instruct dataset.
Figure 3: Left: Performance Comparison on MATH500 vs. RL Baselines for 256/128/64 diffusion steps with a fixed generation length of 512.d-GRPO denotes diffu-GRPO and is reproduced with zhao2025d1's codebase. Methods marked with superscripts (d1$^{\dagger}$, wd1$^{\ddagger}$, d2$^{\diamond}$) are reported from prior work zhao2025d1tang2025wd1weightedpolicyoptimizationwang2025d2improvedtechniquestraining. Right: Training-time cost in GPU hours ($\downarrow$ is better) for diffu-GRPO and CoDD under our implementation.
Figure 4: Qualitative comparison: Exponent Equation (64 steps). At reduced sampling budgets, the baseline model suffers from severe mode collapse (repeating tokens). CoDD effectively steers the generation to recover coherent reasoning.
Figure 5: Algebraic simplification at low compute (64 steps). The baseline model correctly derives the intermediate terms but fails at the final combination step, effectively "forgetting" the remaining terms ($3x-6$). CoDD maintains coherence through the multi-step reasoning to reach the correct solution.

Theorems & Definitions (4)

Definition 4.1: Probabilistic Circuits
Definition 3.1: Decomposability
Definition 4.1: Independent Virtual-Evidence
Theorem 4.2

Breaking the Factorization Barrier in Diffusion Language Models

TL;DR

Abstract

Breaking the Factorization Barrier in Diffusion Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (4)