Effective Test-Time Scaling of Discrete Diffusion through Iterative Refinement
Sanghyun Lee, Sunwoo Kim, Seungryong Kim, Jongho Park, Dongmin Park
TL;DR
This work tackles the challenge of improving discrete diffusion models at test time through IterRef, a reward-guided iterative refinement framework. By formulating refinement as a Multiple-Try Metropolis process with a noising–denoising transition, it guarantees convergence toward a reward-aligned distribution while enabling selective, computation-aware application along the denoising trajectory. Empirically, IterRef delivers consistent performance gains across language and image tasks, particularly under low compute budgets, and demonstrates robust detoxification capabilities in safety-alignment scenarios. The approach advances practical test-time scaling for discrete diffusion, supported by theoretical convergence and comprehensive empirical validation across multiple backbones and rewards.
Abstract
Test-time scaling through reward-guided generation remains largely unexplored for discrete diffusion models despite its potential as a promising alternative. In this work, we introduce Iterative Reward-Guided Refinement (IterRef), a novel test-time scaling method tailored to discrete diffusion that leverages reward-guided noising-denoising transitions to progressively refine misaligned intermediate states. We formalize this process within a Multiple-Try Metropolis (MTM) framework, proving convergence to the reward-aligned distribution. Unlike prior methods that assume the current state is already aligned with the reward distribution and only guide the subsequent transition, our approach explicitly refines each state in situ, progressively steering it toward the optimal intermediate distribution. Across both text and image domains, we evaluate IterRef on diverse discrete diffusion models and observe consistent improvements in reward-guided generation quality. In particular, IterRef achieves striking gains under low compute budgets, far surpassing prior state-of-the-art baselines.
