Table of Contents
Fetching ...

Aligning Diffusion Language Models via Unpaired Preference Optimization

Vaibhav Jindal, Hejian Sang, Chun-Mao Lai, Yanning Chen, Zhipeng Wang

TL;DR

This work tackles the challenge of aligning diffusion language models (dLLMs) to human preferences when explicit paired comparisons are scarce. It introduces ELBO-KTO, which substitutes the intractable diffusion log-likelihood with a Monte Carlo ELBO margin and embeds it within a Kahneman–Tversky optimization framework using unpaired feedback, augmented by variance-reduction techniques. The authors provide a theoretical analysis of bias and variance, introduce a global minibatch baseline, and demonstrate strong empirical gains: 65.9% adjusted win rate on kto-mix-14k and 62.3% on UltraFeedback-Binary, with competitive downstream performance on GSM8K and MMLU. This work establishes unpaired preference optimization as a practical alternative to paired alignment for diffusion LLMs, enabling efficient use of binary feedback signals in model alignment.

Abstract

Diffusion language models (dLLMs) are an emerging alternative to autoregressive (AR) generators, but aligning them to human preferences is challenging because sequence log-likelihoods are intractable and pairwise preference data are costly to collect. We introduce ELBO-KTO, which combines an ELBO surrogate for diffusion log-likelihoods with a prospect-theoretic, unpaired preference objective (Kahneman Tversky Optimization, KTO). We analyze the bias and variance induced by the ELBO substitution and employ variance-reduction practices that stabilize gradients during training. Applied to LLaDA-8B-Instruct, ELBO-KTO yields 65.9% and 62.3% adjusted win rates on kto-mix-14k and UltraFeedback-Binary, respectively, versus the base model under an automatic LLM judge. Across downstream tasks, including GSM8K, MMLU, and additional reasoning/knowledge benchmarks, ELBO-KTO trained on UltraFeedback-Binary performs on par with or better than the base model under identical decoding. This establishes unpaired preference optimization as a viable alternative to pairwise alignment in diffusion LLMs.

Aligning Diffusion Language Models via Unpaired Preference Optimization

TL;DR

This work tackles the challenge of aligning diffusion language models (dLLMs) to human preferences when explicit paired comparisons are scarce. It introduces ELBO-KTO, which substitutes the intractable diffusion log-likelihood with a Monte Carlo ELBO margin and embeds it within a Kahneman–Tversky optimization framework using unpaired feedback, augmented by variance-reduction techniques. The authors provide a theoretical analysis of bias and variance, introduce a global minibatch baseline, and demonstrate strong empirical gains: 65.9% adjusted win rate on kto-mix-14k and 62.3% on UltraFeedback-Binary, with competitive downstream performance on GSM8K and MMLU. This work establishes unpaired preference optimization as a practical alternative to paired alignment for diffusion LLMs, enabling efficient use of binary feedback signals in model alignment.

Abstract

Diffusion language models (dLLMs) are an emerging alternative to autoregressive (AR) generators, but aligning them to human preferences is challenging because sequence log-likelihoods are intractable and pairwise preference data are costly to collect. We introduce ELBO-KTO, which combines an ELBO surrogate for diffusion log-likelihoods with a prospect-theoretic, unpaired preference objective (Kahneman Tversky Optimization, KTO). We analyze the bias and variance induced by the ELBO substitution and employ variance-reduction practices that stabilize gradients during training. Applied to LLaDA-8B-Instruct, ELBO-KTO yields 65.9% and 62.3% adjusted win rates on kto-mix-14k and UltraFeedback-Binary, respectively, versus the base model under an automatic LLM judge. Across downstream tasks, including GSM8K, MMLU, and additional reasoning/knowledge benchmarks, ELBO-KTO trained on UltraFeedback-Binary performs on par with or better than the base model under identical decoding. This establishes unpaired preference optimization as a viable alternative to pairwise alignment in diffusion LLMs.

Paper Structure

This paper contains 85 sections, 7 theorems, 96 equations, 1 figure, 6 tables.

Key Result

Theorem 1

The minibatch loss bias relative to the global-baseline target satisfies where $\Psi(S)$ is the centered-margin variance aggregator defined in equation eq:Psi.

Figures (1)

  • Figure 1: Adjusted win rate vs. LLaDA-8B-Instruct on kto-mix-14k when varying the ratio of desirable to undesirable examples. Left: subsampling desirable examples; Right: subsampling undesirable examples. ELBO-KTO benefits more from desirable examples, consistent with gain sensitivity.

Theorems & Definitions (14)

  • Theorem 1: Loss bias bound
  • Theorem 2: Loss variance bound
  • Theorem 3: Gradient bias bound
  • Theorem 4: Gradient variance
  • Lemma 1: Global Baseline Optimality
  • Lemma 2: Lipschitz constants of the scaled sigmoid
  • proof
  • Lemma 3: Centered-Margin Variance Aggregator
  • proof
  • proof
  • ...and 4 more