Table of Contents
Fetching ...

Preference-Based Alignment of Discrete Diffusion Models

Umberto Borso, Davide Paglieri, Jude Wells, Tim Rocktäschel

TL;DR

This work tackles aligning discrete diffusion models with task-specific preferences when explicit reward signals are unavailable. It defines Discrete Diffusion DPO (D2-DPO), adapting Direct Preference Optimization to continuous-time Markov chain-based diffusion and deriving a loss that uses pairwise preferences to guide the forward process while preserving fidelity to a reference distribution. A key result is a closed-form, efficiently estimable objective under masking-state noise, with a final loss expressed as $L_{\textbf{D2-DPO}}(\theta) = -\mathbb{E}_{(x_1^w, x_1^l) \sim \mathcal{P}, t \sim \mathcal{U}[0,1], x^w \sim q(x_t|x_1^w), x^l \sim q(x_t|x_1^l)} \log \sigma[ \beta\,\mathcal{D}^{\theta}_{\text{ref}}(x_t^w|x_1^w) - \beta\,\mathcal{D}^{\theta}_{\text{ref}}(x_t^l|x_1^l) ]$; this enables direct, reward-free fine-tuning. Empirically, D2-DPO successfully biases discrete diffusion outputs toward preferred structured forms (e.g., odd integers) while preserving validity, demonstrating a practical alternative to RL-based methods. The framework holds promise for scaling to more complex tasks such as language and protein sequence design and supports exploring diverse noise schedules.

Abstract

Diffusion models have achieved state-of-the-art performance across multiple domains, with recent advancements extending their applicability to discrete data. However, aligning discrete diffusion models with task-specific preferences remains challenging, particularly in scenarios where explicit reward functions are unavailable. In this work, we introduce Discrete Diffusion DPO (D2-DPO), the first adaptation of Direct Preference Optimization (DPO) to discrete diffusion models formulated as continuous-time Markov chains. Our approach derives a novel loss function that directly fine-tunes the generative process using preference data while preserving fidelity to a reference distribution. We validate D2-DPO on a structured binary sequence generation task, demonstrating that the method effectively aligns model outputs with preferences while maintaining structural validity. Our results highlight that D2-DPO enables controlled fine-tuning without requiring explicit reward models, making it a practical alternative to reinforcement learning-based approaches. Future research will explore extending D2-DPO to more complex generative tasks, including language modeling and protein sequence generation, as well as investigating alternative noise schedules, such as uniform noising, to enhance flexibility across different applications.

Preference-Based Alignment of Discrete Diffusion Models

TL;DR

This work tackles aligning discrete diffusion models with task-specific preferences when explicit reward signals are unavailable. It defines Discrete Diffusion DPO (D2-DPO), adapting Direct Preference Optimization to continuous-time Markov chain-based diffusion and deriving a loss that uses pairwise preferences to guide the forward process while preserving fidelity to a reference distribution. A key result is a closed-form, efficiently estimable objective under masking-state noise, with a final loss expressed as ; this enables direct, reward-free fine-tuning. Empirically, D2-DPO successfully biases discrete diffusion outputs toward preferred structured forms (e.g., odd integers) while preserving validity, demonstrating a practical alternative to RL-based methods. The framework holds promise for scaling to more complex tasks such as language and protein sequence design and supports exploring diverse noise schedules.

Abstract

Diffusion models have achieved state-of-the-art performance across multiple domains, with recent advancements extending their applicability to discrete data. However, aligning discrete diffusion models with task-specific preferences remains challenging, particularly in scenarios where explicit reward functions are unavailable. In this work, we introduce Discrete Diffusion DPO (D2-DPO), the first adaptation of Direct Preference Optimization (DPO) to discrete diffusion models formulated as continuous-time Markov chains. Our approach derives a novel loss function that directly fine-tunes the generative process using preference data while preserving fidelity to a reference distribution. We validate D2-DPO on a structured binary sequence generation task, demonstrating that the method effectively aligns model outputs with preferences while maintaining structural validity. Our results highlight that D2-DPO enables controlled fine-tuning without requiring explicit reward models, making it a practical alternative to reinforcement learning-based approaches. Future research will explore extending D2-DPO to more complex generative tasks, including language modeling and protein sequence generation, as well as investigating alternative noise schedules, such as uniform noising, to enhance flexibility across different applications.

Paper Structure

This paper contains 18 sections, 50 equations, 1 figure.

Figures (1)

  • Figure 1: Results for preference-based alignment using the D2-DPO loss. (Left) Training loss monotonically decreases over epochs. (Center) Ratio of generated sequences corresponding to odd integers increases w.r.t. reference model. (Right) Fraction of generated sequences with valid structure remains close to 1.