Fine-Tuning Discrete Diffusion Models with Policy Gradient Methods
Oussama Zekri, Nicolas Boullé
TL;DR
<3-5 sentence high-level summary> SEPO introduces a principled, policy-gradient-based method for fine-tuning discrete diffusion models with non-differentiable rewards, addressing a key bottleneck in RLHF for discrete generative tasks. It combines a clipped-ratio objective with self-normalized importance sampling and a gradient-flow perspective to achieve scalable, low-variance updates, with convergence analysis under standard assumptions. The approach is validated on DNA sequence design and language modeling, where SEPO achieves state-of-the-art enhancer activity and high chromatin accessibility, while maintaining stability and reasonable compute. SEPO's generality for conditional and unconditional generation and its explicit treatment of non-differentiable rewards position it as a versatile tool for discrete-generation fine-tuning.
Abstract
Discrete diffusion models have recently gained significant attention due to their ability to process complex discrete structures for language modeling. However, fine-tuning these models with policy gradient methods, as is commonly done in Reinforcement Learning from Human Feedback (RLHF), remains a challenging task. We propose an efficient, broadly applicable, and theoretically justified policy gradient algorithm, called Score Entropy Policy Optimization (\SEPO), for fine-tuning discrete diffusion models over non-differentiable rewards. Our numerical experiments across several discrete generative tasks demonstrate the scalability and efficiency of our method. Our code is available at https://github.com/ozekri/SEPO.
