Table of Contents
Fetching ...

Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling

Meihua Dang, Jiaqi Han, Minkai Xu, Kai Xu, Akash Srivastava, Stefano Ermon

TL;DR

This work addresses steering discrete diffusion language models toward task-specific rewards without retraining. It introduces PG-DLM, a trajectory-level inference-time method that iteratively refines full generations via a conditional sequential Monte Carlo kernel, yielding convergence guarantees and improved reward optimization while preserving generation quality. The authors develop a unified framework for multi-axis inference-time scaling and show that, under fixed compute budgets, increasing the number of iterations $m$ often yields the best reward-perplexity trade-off, with strong empirical gains over baselines on MDLM and LLaDA-8B across toxicity, sentiment, and linguistic-acceptability tasks. The approach demonstrates broad compatibility with various diffusion processes and provides practical guidance for compute allocation in real-world constrained settings, enabling scalable, controllable generation without retraining.

Abstract

Discrete diffusion models have recently emerged as strong alternatives to autoregressive language models, matching their performance through large-scale training. However, inference-time control remains relatively underexplored. In this work, we study how to steer generation toward desired rewards without retraining the models. Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement. We introduce particle Gibbs sampling for diffusion language models (PG-DLM), a novel inference-time algorithm enabling trajectory-level refinement while preserving generation perplexity under reward optimization. PG-DLM constructs a Markov chain over full denoising trajectories and applies a conditional sequential Monte Carlo kernel to resample them. We derive theoretical guarantees for convergence, including asymptotic consistency and variance bounds. Within this framework, we further analyze trade-offs across four key axes for inference-time scaling under fixed budgets: iterations, samples, denoising steps, and reward estimation. Our analysis shows scaling iterations achieves the best reward-perplexity trade-off. Empirically, PG-DLM consistently outperforms prior methods using MDLM and LLaDA-8B as base models across a wide range of compute budgets for reward-guided generation tasks including toxicity and sentiment control as well as linguistic acceptability.

Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling

TL;DR

This work addresses steering discrete diffusion language models toward task-specific rewards without retraining. It introduces PG-DLM, a trajectory-level inference-time method that iteratively refines full generations via a conditional sequential Monte Carlo kernel, yielding convergence guarantees and improved reward optimization while preserving generation quality. The authors develop a unified framework for multi-axis inference-time scaling and show that, under fixed compute budgets, increasing the number of iterations often yields the best reward-perplexity trade-off, with strong empirical gains over baselines on MDLM and LLaDA-8B across toxicity, sentiment, and linguistic-acceptability tasks. The approach demonstrates broad compatibility with various diffusion processes and provides practical guidance for compute allocation in real-world constrained settings, enabling scalable, controllable generation without retraining.

Abstract

Discrete diffusion models have recently emerged as strong alternatives to autoregressive language models, matching their performance through large-scale training. However, inference-time control remains relatively underexplored. In this work, we study how to steer generation toward desired rewards without retraining the models. Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement. We introduce particle Gibbs sampling for diffusion language models (PG-DLM), a novel inference-time algorithm enabling trajectory-level refinement while preserving generation perplexity under reward optimization. PG-DLM constructs a Markov chain over full denoising trajectories and applies a conditional sequential Monte Carlo kernel to resample them. We derive theoretical guarantees for convergence, including asymptotic consistency and variance bounds. Within this framework, we further analyze trade-offs across four key axes for inference-time scaling under fixed budgets: iterations, samples, denoising steps, and reward estimation. Our analysis shows scaling iterations achieves the best reward-perplexity trade-off. Empirically, PG-DLM consistently outperforms prior methods using MDLM and LLaDA-8B as base models across a wide range of compute budgets for reward-guided generation tasks including toxicity and sentiment control as well as linguistic acceptability.

Paper Structure

This paper contains 31 sections, 2 theorems, 21 equations, 8 figures, 8 tables, 2 algorithms.

Key Result

Theorem 1

Assume that: (1) the diffusion model $p_\theta({\mathbf{x}}_0 \:\vert\:{\mathbf{c}})$ provides accurate posterior mean estimation, i.e., samples from $p_\theta({\mathbf{x}}_0 \:\vert\:{\mathbf{c}}, {\mathbf{x}}_t)$ are unbiased for the true posterior mean as the noise level approaches zero; and (2)

Figures (8)

  • Figure 1: Illustration of PG-DLM. At each iteration, a reference trajectory is fixed (top row), new trajectories are generated and resampled (gray). The highest-reward one becomes the next reference (colored), enabling iterative refinement. The final outputs are selected after multiple iterations.
  • Figure 1: Accuracy at high NFE.
  • Figure 2: Trade-off between particle Gibbs iterations $m$ and sample counts $k$ across compute budgets (NFEs). The x-axis shows NFEs controlled by varying $k$, and the legend shows $m$. Increasing $k$ (with $m\!=\!1$) performs best in low-NFE regimes. However, as samples saturate, additional iterations ($m\!=\!2,4$) become more effective.
  • Figure 3: Toxicity accuracy (blue) and perplexity (gray) as compute budgets increase, by varying iterations $m$ (left) and samples $k$ (right)
  • Figure 4: Trade-offs between sample counts $k$ and denoising steps $T$ across compute budgets (NFEs). For (a) LLaDA, the x-axis shows NFEs controlled by varying $k$, with $T$ in the legend; for (b-d) MDLM, the x-axis shows NFEs controlled by varying $T$, with $k$ in the legend. Scaling $k$ (and decreasing $T$ accordingly) generally yields better performance under the same NFEs.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Theorem 1: Asymptotic Consistency
  • Theorem 2: Variance Bound