Margin-aware Preference Optimization for Aligning Diffusion Models without Reference

Jiwoo Hong; Sayak Paul; Noah Lee; Kashif Rasul; James Thorne; Jongheon Jeong

Margin-aware Preference Optimization for Aligning Diffusion Models without Reference

Jiwoo Hong, Sayak Paul, Noah Lee, Kashif Rasul, James Thorne, Jongheon Jeong

TL;DR

MaPO removes the need for a reference model in diffusion-model preference alignment by introducing a margin-based, reference-free Bradley-Terry objective for T2I fine-tuning. The approach directly regularizes the likelihood margin between chosen and rejected outputs, enabling stable, efficient alignment across safe generation, style adaptation, cultural representation, personalization, and general preference tasks. Empirical results show MaPO often outperforms Diffusion-DPO and specialized methods, with pronounced gains as reference mismatch grows, and it reduces training time by about 15%. This work provides a unified, memory-efficient framework for broad T2I adaptation that mitigates reference-mismatch issues in real-world data.

Abstract

Modern preference alignment methods, such as DPO, rely on divergence regularization to a reference model for training stability-but this creates a fundamental problem we call "reference mismatch." In this paper, we investigate the negative impacts of reference mismatch in aligning text-to-image (T2I) diffusion models, showing that larger reference mismatch hinders effective adaptation given the same amount of data, e.g., as when learning new artistic styles, or personalizing to specific objects. We demonstrate this phenomenon across text-to-image (T2I) diffusion models and introduce margin-aware preference optimization (MaPO), a reference-agnostic approach that breaks free from this constraint. By directly optimizing the likelihood margin between preferred and dispreferred outputs under the Bradley-Terry model without anchoring to a reference, MaPO transforms diverse T2I tasks into unified pairwise preference optimization. We validate MaPO's versatility across five challenging domains: (1) safe generation, (2) style adaptation, (3) cultural representation, (4) personalization, and (5) general preference alignment. Our results reveal that MaPO's advantage grows dramatically with reference mismatch severity, outperforming both DPO and specialized methods like DreamBooth while reducing training time by 15%. MaPO thus emerges as a versatile and memory-efficient method for generic T2I adaptation tasks.

Margin-aware Preference Optimization for Aligning Diffusion Models without Reference

TL;DR

Abstract

Paper Structure (41 sections, 14 equations, 41 figures, 4 tables)

This paper contains 41 sections, 14 equations, 41 figures, 4 tables.

Introduction
Preliminaries
Text-to-image diffusion models
Preference optimization for alignment
Margin-aware Preference Optimization
Motivation: Reference mismatch problem
Reference mismatch in T2I tasks
Approach: Reference-free diffusion alignment
Objective function of MaPO
Joint matching and alignment
Preference learning as a margin regularization
Theoretical justification for reference-free link functions
Unifying T2I fine-tuning as preference alignment
Experiments
Experimental details
...and 26 more sections

Figures (41)

Figure 1: Reference mismatch, heterogeneity between the training model's distribution $p_\theta$ and the data distribution $p_\text{data}$, leads to suboptimal preference learning. MaPO is a reference-free preference optimization method, more robust to such heterogeneity while adapting to the preference.
Figure 2: Reference mismatch between the model generation $x_0^\theta$ and data $x_0^\mathcal{D}$ quantified by the cosine distance in the DINOv2 embeddings. The personalization task has the lowest similarity, implying the highest mismatch.
Figure 3: Comparison between SFT, DPO, and MaPO on aligning SDXL for cultural representation, safe generation, and style learning tasks with increasing size of train set (train set size of 0 refers to the base SDXL). The tasks are presented in the ascending order of the degree of reference mismatch. We use Qwen2-VL-7B-Instruct (top) and GPT-4o (bottom) as judges.
Figure 4: MaPO in safe generation - The base SDXL, Diffusion-DPO, and MaPO trained with Pick-Safety.
Figure 5: MaPO in style learning - SDXL trained on 5,000 instances in Pick-Cartoon with MaPO and DPO.
...and 36 more figures

Margin-aware Preference Optimization for Aligning Diffusion Models without Reference

TL;DR

Abstract

Margin-aware Preference Optimization for Aligning Diffusion Models without Reference

Authors

TL;DR

Abstract

Table of Contents

Figures (41)