Table of Contents
Fetching ...

Fine-Tuning Discrete Diffusion Models with Policy Gradient Methods

Oussama Zekri, Nicolas Boullé

TL;DR

<3-5 sentence high-level summary> SEPO introduces a principled, policy-gradient-based method for fine-tuning discrete diffusion models with non-differentiable rewards, addressing a key bottleneck in RLHF for discrete generative tasks. It combines a clipped-ratio objective with self-normalized importance sampling and a gradient-flow perspective to achieve scalable, low-variance updates, with convergence analysis under standard assumptions. The approach is validated on DNA sequence design and language modeling, where SEPO achieves state-of-the-art enhancer activity and high chromatin accessibility, while maintaining stability and reasonable compute. SEPO's generality for conditional and unconditional generation and its explicit treatment of non-differentiable rewards position it as a versatile tool for discrete-generation fine-tuning.

Abstract

Discrete diffusion models have recently gained significant attention due to their ability to process complex discrete structures for language modeling. However, fine-tuning these models with policy gradient methods, as is commonly done in Reinforcement Learning from Human Feedback (RLHF), remains a challenging task. We propose an efficient, broadly applicable, and theoretically justified policy gradient algorithm, called Score Entropy Policy Optimization (\SEPO), for fine-tuning discrete diffusion models over non-differentiable rewards. Our numerical experiments across several discrete generative tasks demonstrate the scalability and efficiency of our method. Our code is available at https://github.com/ozekri/SEPO.

Fine-Tuning Discrete Diffusion Models with Policy Gradient Methods

TL;DR

<3-5 sentence high-level summary> SEPO introduces a principled, policy-gradient-based method for fine-tuning discrete diffusion models with non-differentiable rewards, addressing a key bottleneck in RLHF for discrete generative tasks. It combines a clipped-ratio objective with self-normalized importance sampling and a gradient-flow perspective to achieve scalable, low-variance updates, with convergence analysis under standard assumptions. The approach is validated on DNA sequence design and language modeling, where SEPO achieves state-of-the-art enhancer activity and high chromatin accessibility, while maintaining stability and reasonable compute. SEPO's generality for conditional and unconditional generation and its explicit treatment of non-differentiable rewards position it as a versatile tool for discrete-generation fine-tuning.

Abstract

Discrete diffusion models have recently gained significant attention due to their ability to process complex discrete structures for language modeling. However, fine-tuning these models with policy gradient methods, as is commonly done in Reinforcement Learning from Human Feedback (RLHF), remains a challenging task. We propose an efficient, broadly applicable, and theoretically justified policy gradient algorithm, called Score Entropy Policy Optimization (\SEPO), for fine-tuning discrete diffusion models over non-differentiable rewards. Our numerical experiments across several discrete generative tasks demonstrate the scalability and efficiency of our method. Our code is available at https://github.com/ozekri/SEPO.

Paper Structure

This paper contains 89 sections, 103 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: On the estimation of $\boldsymbol{q}^\theta_t(y)$.Left: Given a sample $x$, only its neighbors $\{y_i\}_{1\leq i\leq 5} \in \mathcal{X}$ are accessible for computing $q_t^\theta(\textcolor{greenfig}{y_i} \mid \textcolor{redfig}{x})$, and each $y_i$ typically has only one such parent $x$ in the sampled batch. Right: For a given $y_5$, there are several neighbours $\{z_j\}_{1\leq j\leq5}$. It is unlikely to find multiple distinct samples such that both are neighbors of $y_5$, since this would require them to differ from $y_5$ at exactly the same token position.
  • Figure 2: Violin plot of Pred-Activity scores across models. The plot shows the distribution of predicted enhancer activity (Pred-Activity) from the held-out reward oracle for each model, across $640$ generated sequences. SEPO and SEPO with GF achieve the highest Pred-Activity scores with low variance, illustrating the effectiveness as well as the stability of our optimization process. Results for GLID$^2$E are not displayed as the corresponding finetuned model weights are not publicly available.
  • Figure 3: Illustration of the iterative fine-tuning process for discrete diffusion models using policy gradient methods. The initial model $\overline{Q}_{\theta_{\mathrm{pre}}}$(conditionally) generates responses, which are evaluated by a reward function. Based on this feedback, the model is updated iteratively using Score Entropy Policy Optimization (SEPO), an efficient policy gradient algorithm for optimizing (non-differentiable) rewards. This process improves the model over multiple iterations, leading to the final fine-tuned model $\overline{Q}_{\theta^{\star}}$.
  • Figure 4: GPT-$2$ Reward modeling pipeline.
  • Figure 5: SEPO fine-tuning pipeline for SEDD Medium.
  • ...and 4 more figures

Theorems & Definitions (7)

  • Remark 3.3
  • proof
  • proof
  • Remark C.1
  • proof
  • proof
  • Remark C.2