Table of Contents
Fetching ...

Effective Test-Time Scaling of Discrete Diffusion through Iterative Refinement

Sanghyun Lee, Sunwoo Kim, Seungryong Kim, Jongho Park, Dongmin Park

TL;DR

This work tackles the challenge of improving discrete diffusion models at test time through IterRef, a reward-guided iterative refinement framework. By formulating refinement as a Multiple-Try Metropolis process with a noising–denoising transition, it guarantees convergence toward a reward-aligned distribution while enabling selective, computation-aware application along the denoising trajectory. Empirically, IterRef delivers consistent performance gains across language and image tasks, particularly under low compute budgets, and demonstrates robust detoxification capabilities in safety-alignment scenarios. The approach advances practical test-time scaling for discrete diffusion, supported by theoretical convergence and comprehensive empirical validation across multiple backbones and rewards.

Abstract

Test-time scaling through reward-guided generation remains largely unexplored for discrete diffusion models despite its potential as a promising alternative. In this work, we introduce Iterative Reward-Guided Refinement (IterRef), a novel test-time scaling method tailored to discrete diffusion that leverages reward-guided noising-denoising transitions to progressively refine misaligned intermediate states. We formalize this process within a Multiple-Try Metropolis (MTM) framework, proving convergence to the reward-aligned distribution. Unlike prior methods that assume the current state is already aligned with the reward distribution and only guide the subsequent transition, our approach explicitly refines each state in situ, progressively steering it toward the optimal intermediate distribution. Across both text and image domains, we evaluate IterRef on diverse discrete diffusion models and observe consistent improvements in reward-guided generation quality. In particular, IterRef achieves striking gains under low compute budgets, far surpassing prior state-of-the-art baselines.

Effective Test-Time Scaling of Discrete Diffusion through Iterative Refinement

TL;DR

This work tackles the challenge of improving discrete diffusion models at test time through IterRef, a reward-guided iterative refinement framework. By formulating refinement as a Multiple-Try Metropolis process with a noising–denoising transition, it guarantees convergence toward a reward-aligned distribution while enabling selective, computation-aware application along the denoising trajectory. Empirically, IterRef delivers consistent performance gains across language and image tasks, particularly under low compute budgets, and demonstrates robust detoxification capabilities in safety-alignment scenarios. The approach advances practical test-time scaling for discrete diffusion, supported by theoretical convergence and comprehensive empirical validation across multiple backbones and rewards.

Abstract

Test-time scaling through reward-guided generation remains largely unexplored for discrete diffusion models despite its potential as a promising alternative. In this work, we introduce Iterative Reward-Guided Refinement (IterRef), a novel test-time scaling method tailored to discrete diffusion that leverages reward-guided noising-denoising transitions to progressively refine misaligned intermediate states. We formalize this process within a Multiple-Try Metropolis (MTM) framework, proving convergence to the reward-aligned distribution. Unlike prior methods that assume the current state is already aligned with the reward distribution and only guide the subsequent transition, our approach explicitly refines each state in situ, progressively steering it toward the optimal intermediate distribution. Across both text and image domains, we evaluate IterRef on diverse discrete diffusion models and observe consistent improvements in reward-guided generation quality. In particular, IterRef achieves striking gains under low compute budgets, far surpassing prior state-of-the-art baselines.

Paper Structure

This paper contains 62 sections, 1 theorem, 14 equations, 8 figures, 3 tables, 5 algorithms.

Key Result

Proposition 1

Let $x_t$ be a sample drawn from a distribution that is not reward-aligned. By applying MTM with the transition kernel $K$ and balancing function $\lambda$ defined above, the resulting Markov chain satisfies the detailed balance condition. Moreover, as the number of iterations $n\!\to\!\infty$, the

Figures (8)

  • Figure 1: Overview of IterRef. (a) Reward-guided denoising trajectories: Blue nodes are selected samples, gray nodes are rejected candidates. Unlike existing single-step guidance methods (IS and SMC), IterRef discovers higher-reward samples by iteratively applying noising–denoising kernels. Noising process (dotted nodes) with random remasking incurs negligible cost, while offering broader regions to explore and correct tokens. (b) Scaling performance. IterRef scales significantly faster (up to 8×) than baselines with a safety reward on LLaDA-8B (See $\S$\ref{['section:safety']} for details).
  • Figure 1: Quantitative Results with MaskGIT. We compare IterRef with baselines under varying computational costs, guided by CLIPScore. IterRef performs the best across all settings.
  • Figure 2: Performance comparison of IterRef with baselines on four guided generation tasks (CoLA, Toxicity, Sentiment, and Perplexity) under varying inference costs (NFEs) with two discrete diffusion backbones (MDLM and LLaDA).
  • Figure 3: Qualitative results on MaskGIT: samples generated by baselines and IterRef.
  • Figure 4: Scaling effects of MDLM with $N$ and $k$. The figure illustrates the trade-off between iteration count $k$ and candidates $N$. Increasing $k$ consistently yields greater performance gains than increasing $N$, demonstrating the efficacy of iteration.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Remark 1: Arising naturally from the proof of Theorem 1 in uehara2024bridging
  • Proposition 1: Convergence of MTM to the Optimal Distribution
  • proof