Table of Contents
Fetching ...

Discrete Diffusion Trajectory Alignment via Stepwise Decomposition

Jiaqi Han, Austin Wang, Minkai Xu, Wenda Chu, Meihua Dang, Haotian Ye, Huayu Chen, Yisong Yue, Stefano Ermon

TL;DR

This work tackles aligning discrete diffusion models with reward signals by introducing offline stepwise diffusion trajectory optimization (SDPO). It decomposes the trajectory-level objective into per-step posterior alignments $\hat{p}_\theta({\mathbf{x}}_0|{\mathbf{x}}_t,{\mathbf{c}})$, and establishes an equivalence to the full trajectory objective when the chain reward $\hat{r}({\mathbf{x}}_{0:T},{\mathbf{c}})$ is additively factorized over steps. A distribution-matching formulation with $\tilde{p}_r$ and $\tilde{p}_\theta$ enables principled, off-policy learning using arbitrary reward models, accompanied by a tractable MC-based loss $\mathcal{L}(\theta)$ and a per-step implicit reward $\tilde{r}_\theta$. Empirically, SDPO yields significant improvements across DNA sequence design, protein inverse folding, and large-language diffusion tasks, outperforming RL-based baselines and delivering faster training times due to offline, stepwise optimization. This approach offers a general, reward-robust framework for aligning discrete diffusion models to diverse objectives with broad applicability in biology and language tasks.

Abstract

Discrete diffusion models have demonstrated great promise in modeling various sequence data, ranging from human language to biological sequences. Inspired by the success of RL in language models, there is growing interest in further improving the models by alignment with a certain reward. In this work, we propose an offline preference optimization method to approach trajectory alignment for discrete diffusion models. Instead of applying the reward on the final output and backpropagating the gradient to the entire denoising process, we decompose the problem into a set of stepwise alignment objectives by matching the per-step posterior. This framework enables efficient diffusion optimization, is compatible with arbitrary reward functions, and importantly, yields an equivalent optimal solution under additive factorization of the trajectory reward. Experiments across multiple domains including DNA sequence design, protein inverse folding, and language modeling consistently demonstrate the superiority of our approach. Notably, it achieves an up to 12\% improvement over the most competitive RL-based baseline in terms of predicted activity on DNA sequence design, and further improves the GSM8K score from 78.6 to 81.2 on LLaDA-8B-Instruct for language modeling.

Discrete Diffusion Trajectory Alignment via Stepwise Decomposition

TL;DR

This work tackles aligning discrete diffusion models with reward signals by introducing offline stepwise diffusion trajectory optimization (SDPO). It decomposes the trajectory-level objective into per-step posterior alignments , and establishes an equivalence to the full trajectory objective when the chain reward is additively factorized over steps. A distribution-matching formulation with and enables principled, off-policy learning using arbitrary reward models, accompanied by a tractable MC-based loss and a per-step implicit reward . Empirically, SDPO yields significant improvements across DNA sequence design, protein inverse folding, and large-language diffusion tasks, outperforming RL-based baselines and delivering faster training times due to offline, stepwise optimization. This approach offers a general, reward-robust framework for aligning discrete diffusion models to diverse objectives with broad applicability in biology and language tasks.

Abstract

Discrete diffusion models have demonstrated great promise in modeling various sequence data, ranging from human language to biological sequences. Inspired by the success of RL in language models, there is growing interest in further improving the models by alignment with a certain reward. In this work, we propose an offline preference optimization method to approach trajectory alignment for discrete diffusion models. Instead of applying the reward on the final output and backpropagating the gradient to the entire denoising process, we decompose the problem into a set of stepwise alignment objectives by matching the per-step posterior. This framework enables efficient diffusion optimization, is compatible with arbitrary reward functions, and importantly, yields an equivalent optimal solution under additive factorization of the trajectory reward. Experiments across multiple domains including DNA sequence design, protein inverse folding, and language modeling consistently demonstrate the superiority of our approach. Notably, it achieves an up to 12\% improvement over the most competitive RL-based baseline in terms of predicted activity on DNA sequence design, and further improves the GSM8K score from 78.6 to 81.2 on LLaDA-8B-Instruct for language modeling.

Paper Structure

This paper contains 32 sections, 4 theorems, 26 equations, 4 figures, 10 tables, 2 algorithms.

Key Result

Theorem 4.1

The joint $p^\ast({\mathbf{x}}_{0:T}|{\mathbf{c}})$ induced by the optimal solutions $\{\hat{p}^\ast({\mathbf{x}}_0|{\mathbf{x}}_t,{\mathbf{c}})\}_{t=1}^T$ of Eq. eq:stepwise-alignment-objective is also the optimal solution of the trajectory alignment objective in Eq. eq:rlhf-diffusion, with the cha

Figures (4)

  • Figure 1: The flowchart of our SDPO.
  • Figure 2: Ablation studies of $\beta$ in (a) DNA design, and (b) protein inverse folding experiment.
  • Figure 3: Iterative labeling.
  • Figure 4: (a) The reward curve w.r.t. the number of labeled samples throughout training. (b) The correlation analysis between the induced trajectory reward $\hat{r}({\mathbf{x}}_{0:T},{\mathbf{c}})$ and the clean reward $r({\mathbf{x}}_0,{\mathbf{c}})$.

Theorems & Definitions (6)

  • Theorem 4.1
  • Proposition 4.2
  • Theorem 4.1
  • proof
  • Proposition A.2
  • proof