Reward-Guided Discrete Diffusion via Clean-Sample Markov Chain for Molecule and Biological Sequence Design

Prin Phunyaphibarn; Minhyuk Sung

Reward-Guided Discrete Diffusion via Clean-Sample Markov Chain for Molecule and Biological Sequence Design

Prin Phunyaphibarn, Minhyuk Sung

TL;DR

This paper tackles the challenge of reward-guided sample generation in discrete diffusion models for molecules and biological sequences, where reward functions are often non-smooth and intermediate rewards are unreliable. It introduces Clean-Sample Markov Chain (CSMC) Sampler, a training-free, Metropolis-Hastings-based method that operates on clean samples by using a forward–backward diffusion-based proposal, enabling tractable acceptance probabilities and local search without intermediate rewards. Empirical results across QM9, ZINC250K, and MPRA datasets show that CSMC yields the highest rewards across multiple base diffusion architectures (MDM, USM, SEDD variants) and reward functions, with CSMC-B offering faster sampling while maintaining performance. The work demonstrates that accurate, clean-reward guidance can outperform methods relying on intermediate rewards, and provides a versatile framework applicable to both uniform and masked discrete diffusion models, with potential for broader impact in science-guided generative design.

Abstract

Discrete diffusion models have recently emerged as a powerful class of generative models for chemistry and biology data. In these fields, the goal is to generate various samples with high rewards (e.g., drug-likeness in molecules), making reward-based guidance crucial. Most existing methods are based on guiding the diffusion model using intermediate rewards but tend to underperform since intermediate rewards are noisy due to the non-smooth nature of reward functions used in scientific domains. To address this, we propose Clean-Sample Markov Chain (CSMC) Sampler, a method that performs effective test-time reward-guided sampling for discrete diffusion models, enabling local search without relying on intermediate rewards. CSMC constructs a Markov chain of clean samples using the Metropolis-Hastings algorithm such that its stationary distribution is the target distribution. We design a proposal distribution by sequentially applying the forward and backward diffusion processes, making the acceptance probability tractable. Experiments on molecule and biological sequence generation with various reward functions demonstrate that our method consistently outperforms prior approaches that rely on intermediate rewards.

Reward-Guided Discrete Diffusion via Clean-Sample Markov Chain for Molecule and Biological Sequence Design

TL;DR

Abstract

Paper Structure (40 sections, 1 theorem, 19 equations, 10 figures, 11 tables, 2 algorithms)

This paper contains 40 sections, 1 theorem, 19 equations, 10 figures, 11 tables, 2 algorithms.

Introduction
Related Work
Training-Free Reward-Guided Sampling for Discrete Diffusion.
Training-Based Reward-Guided Sampling for Discrete Diffusion.
Background
Discrete and Continuous-Time Discrete Diffusion
Discrete-Time Discrete Diffusion.
Continuous-Time Discrete Diffusion.
Masked and Uniform Discrete Diffusion
Masked Diffusion Models (MDMs).
Uniform State Models (USMs).
Clean-Sample Markov Chain Sampler
Bypassing the Intermediate Rewards
Forward-Backward Proposal Distribution
Experiments
...and 25 more sections

Key Result

Theorem 1.1

If the transition matrix defined by Eq. eq:mh-transition defined by the MH algorithm is ergodic and irreducible, $p^\star$ is its unique limiting distribution.

Figures (10)

Figure 1: Left: In scientific applications, the rewards defined on discrete spaces are highly sensitive to small perturbations. A one-character change to a SMILE string can result in an invalid string with zero reward. Properties such as QED, ring count, and synthetic accessibility (SA) can also vary significantly even when changing only one or two tokens. Right: Reverse diffusion and typical inference-time scaling methods kimtestsmcli2024svdd such as SMC kimtestsmc rely on guiding samples through noise levels by constructing a Markov chain beginning with pure noise and ending with clean samples. Our CSMC constructs a Markov chain consisting of only clean samples by successively applying the forward and reverse processes sequentially at each step. This formulation bypasses the need for intermediate rewards by evaluating the reward directly on clean samples while leveraging information from past samples for guidance.
Figure 2: Reward distributions of the pretrained USM. The red dotted line represents the average reward achieved by CSMC. For ring count and HepG2, the pretrained model reward distribution has low density at higher rewards, resulting in degraded performance for BoN sampling.
Figure 3: Autocorrelation plots for ZINC250K MDM and USM. The autocorrelation function quickly vanishes to zero within the first 2000 iterations, indicating fast mixing.
Figure 4: Reward trajectories with different initializations. We plot the reward trajectories of CSMC for 64 different initializations using MDM on QM9 QED (bottom), along with the reward distribution of the 64 different chains (top). Trajectories with lower reward also quickly converge to high reward regions of the distribution, demonstrating robustness to the initialization.
Figure 5: Autocorrelation plots for ZINC250K MDM and USM.
...and 5 more figures

Theorems & Definitions (2)

Theorem 1.1: Theorem 12.2.1 from murphy2023probabilistic
proof

Reward-Guided Discrete Diffusion via Clean-Sample Markov Chain for Molecule and Biological Sequence Design

TL;DR

Abstract

Reward-Guided Discrete Diffusion via Clean-Sample Markov Chain for Molecule and Biological Sequence Design

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (2)