Aligning Diffusion Language Models via Unpaired Preference Optimization

Vaibhav Jindal; Hejian Sang; Chun-Mao Lai; Yanning Chen; Zhipeng Wang

Aligning Diffusion Language Models via Unpaired Preference Optimization

Vaibhav Jindal, Hejian Sang, Chun-Mao Lai, Yanning Chen, Zhipeng Wang

TL;DR

This work tackles the challenge of aligning diffusion language models (dLLMs) to human preferences when explicit paired comparisons are scarce. It introduces ELBO-KTO, which substitutes the intractable diffusion log-likelihood with a Monte Carlo ELBO margin and embeds it within a Kahneman–Tversky optimization framework using unpaired feedback, augmented by variance-reduction techniques. The authors provide a theoretical analysis of bias and variance, introduce a global minibatch baseline, and demonstrate strong empirical gains: 65.9% adjusted win rate on kto-mix-14k and 62.3% on UltraFeedback-Binary, with competitive downstream performance on GSM8K and MMLU. This work establishes unpaired preference optimization as a practical alternative to paired alignment for diffusion LLMs, enabling efficient use of binary feedback signals in model alignment.

Abstract

Diffusion language models (dLLMs) are an emerging alternative to autoregressive (AR) generators, but aligning them to human preferences is challenging because sequence log-likelihoods are intractable and pairwise preference data are costly to collect. We introduce ELBO-KTO, which combines an ELBO surrogate for diffusion log-likelihoods with a prospect-theoretic, unpaired preference objective (Kahneman Tversky Optimization, KTO). We analyze the bias and variance induced by the ELBO substitution and employ variance-reduction practices that stabilize gradients during training. Applied to LLaDA-8B-Instruct, ELBO-KTO yields 65.9% and 62.3% adjusted win rates on kto-mix-14k and UltraFeedback-Binary, respectively, versus the base model under an automatic LLM judge. Across downstream tasks, including GSM8K, MMLU, and additional reasoning/knowledge benchmarks, ELBO-KTO trained on UltraFeedback-Binary performs on par with or better than the base model under identical decoding. This establishes unpaired preference optimization as a viable alternative to pairwise alignment in diffusion LLMs.

Aligning Diffusion Language Models via Unpaired Preference Optimization

TL;DR

Abstract

Aligning Diffusion Language Models via Unpaired Preference Optimization

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (14)