Table of Contents
Fetching ...

Planner Aware Path Learning in Diffusion Language Models Training

Fred Zhangzhi Peng, Zachary Bezemek, Jarrid Rector-Brooks, Shuibai Zhang, Anru R. Zhang, Michael Bronstein, Avishek Joey Bose, Alexander Tong

TL;DR

This paper theoretically proves that the standard discrete diffusion training evidence lower bound (ELBO) does not accurately describe a denoiser that uses a non-uniform planner and derives a new planned evidence lower bound (P-ELBO) that incorporates planner-based reverse dynamics directly into the training objective.

Abstract

Diffusion language models have emerged as a powerful alternative to autoregressive models, enabling fast inference through more flexible and parallel generation paths. This flexibility of sampling is unlocked by new engineered sampling strategies, or planners, that select more favorable generation paths by iteratively planning - versus uniformly at random - where to denoise along the sequence. However, by modifying the reverse paths via planning, planners create an irrevocable mismatch between the uniformly random denoising paths assumed during training and planning-based inference. In this paper, we systematically investigate the mismatch of discrete diffusion training and inference under planning and theoretically prove that the standard discrete diffusion training evidence lower bound (ELBO) does not accurately describe a denoiser that uses a non-uniform planner. To address this gap, we derive a new planned evidence lower bound (P-ELBO) that incorporates planner-based reverse dynamics directly into the training objective. Using the P-ELBO, we introduce Planner Aware Path Learning (PAPL), a novel training scheme that aligns training and inference under a planned denoiser. PAPL is implemented as a simple yet effective modification to the standard masked discrete diffusion loss, making it widely applicable and easy to adopt. Empirically, we show PAPL delivers consistent gains across domains, including a 40% relative improvement in protein sequences, improved text generation with up to a 4x relative MAUVE gain, and 23% relative improvement in code generation HumanEval pass@10. Code is available at https://github.com/pengzhangzhi/PAPL.

Planner Aware Path Learning in Diffusion Language Models Training

TL;DR

This paper theoretically proves that the standard discrete diffusion training evidence lower bound (ELBO) does not accurately describe a denoiser that uses a non-uniform planner and derives a new planned evidence lower bound (P-ELBO) that incorporates planner-based reverse dynamics directly into the training objective.

Abstract

Diffusion language models have emerged as a powerful alternative to autoregressive models, enabling fast inference through more flexible and parallel generation paths. This flexibility of sampling is unlocked by new engineered sampling strategies, or planners, that select more favorable generation paths by iteratively planning - versus uniformly at random - where to denoise along the sequence. However, by modifying the reverse paths via planning, planners create an irrevocable mismatch between the uniformly random denoising paths assumed during training and planning-based inference. In this paper, we systematically investigate the mismatch of discrete diffusion training and inference under planning and theoretically prove that the standard discrete diffusion training evidence lower bound (ELBO) does not accurately describe a denoiser that uses a non-uniform planner. To address this gap, we derive a new planned evidence lower bound (P-ELBO) that incorporates planner-based reverse dynamics directly into the training objective. Using the P-ELBO, we introduce Planner Aware Path Learning (PAPL), a novel training scheme that aligns training and inference under a planned denoiser. PAPL is implemented as a simple yet effective modification to the standard masked discrete diffusion loss, making it widely applicable and easy to adopt. Empirically, we show PAPL delivers consistent gains across domains, including a 40% relative improvement in protein sequences, improved text generation with up to a 4x relative MAUVE gain, and 23% relative improvement in code generation HumanEval pass@10. Code is available at https://github.com/pengzhangzhi/PAPL.

Paper Structure

This paper contains 63 sections, 18 theorems, 126 equations, 12 figures, 7 tables, 4 algorithms.

Key Result

Proposition 3.0

For $p^\text{greedy}_\theta(\mathbf{x}_0)$ defined with $G_\phi$ in equation eq:greedyancestralG and $D_\theta$ an imperfect denoiser, we may have where $\mathcal{E}^{\theta,\text{unif}}(\mathbf{x}_0)$ is as in equation eq:AOARMELBO.

Figures (12)

  • Figure 1: Planner-Aware Path Learning (PAPL) resolves training–inference mismatch in DLMs. Standard uniform training for DLMs (left) applies a uniform loss across all masked positions, distributing capacity over regions that inference-time planners never traverse. PAPL (right) introduces planner-aware weights into the loss, aligning training with the planner’s preferred trajectories (outlined arrows) and eliminating training-inference mismatch.
  • Figure 2: Visualization of PAPL generated proteins folded with ESMFold.
  • Figure 3: PAPL consistently improves over DLM across training, sampling steps, and temperature. (a) Faster convergence in training steps. (b) Higher performance across sampling steps. (c) More robust to temperature when training from scratch. (d) More robust to temperature when fine-tuning.
  • Figure 4: Effect of $\tau$ and $\alpha$ on foldability. Lower $\tau$ ($<1$) improves performance. Increasing $\alpha$ steadily boosts foldability up to $\alpha=5$. The dashed line denotes the vanilla DLM baseline.
  • Figure 5: Training with pure PAPL loss ($\tau=1$) leads to unstable behavior, with large fluctuations in training (left) and poor convergence on validation (right).
  • ...and 7 more figures

Theorems & Definitions (32)

  • Proposition 3.0
  • Proposition 3.0
  • Lemma A.1
  • proof
  • Lemma A.2
  • proof
  • Corollary A.3
  • proof
  • Corollary A.4
  • proof
  • ...and 22 more