Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

Karel D'Oosterlinck; Winnie Xu; Chris Develder; Thomas Demeester; Amanpreet Singh; Christopher Potts; Douwe Kiela; Shikib Mehri

Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

Karel D'Oosterlinck, Winnie Xu, Chris Develder, Thomas Demeester, Amanpreet Singh, Christopher Potts, Douwe Kiela, Shikib Mehri

TL;DR

This paper introduces Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs, and Anchored Preference Optimization (APO), a controllable and more stable alignment objective.

Abstract

Large Language Models (LLMs) are often aligned using contrastive alignment objectives and preference pair datasets. The interaction between model, paired data, and objective makes alignment a complicated procedure, sometimes producing subpar results. We study this and find that (i) preference data gives a better learning signal when the underlying responses are contrastive, and (ii) alignment objectives lead to better performance when they specify more control over the model during training. Based on these insights, we introduce Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs, and Anchored Preference Optimization (APO), a controllable and more stable alignment objective. We align Llama-3-8B-Instruct using various comparable datasets and alignment objectives and measure MixEval-Hard scores, which correlate highly with human judgments. The CLAIR preferences lead to the strongest performance out of all datasets, and APO consistently outperforms less controllable objectives. Our best model, trained on 32K CLAIR preferences with APO, improves Llama-3-8B-Instruct by 7.65%, closing the gap with GPT4-turbo by 45%. Our code is available at https://github.com/ContextualAI/CLAIR_and_APO.

Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

TL;DR

Abstract

Paper Structure (29 sections, 9 equations, 4 figures, 5 tables)

This paper contains 29 sections, 9 equations, 4 figures, 5 tables.

Introduction
Underspecification in Alignment
Contrastive Learning from Revisions
Anchored Preference Optimization
Alignment Experiments
Evaluation Methodology
Training Specifications
Results
Preference Data
Alignment Objectives
Analysis
Preference Data
Alignment Objectives
Related Work
Changing the LM more / less:
...and 14 more sections

Figures (4)

Figure 1: Alignment is underspecified with regard to preferences and training objective. A: Preference pairs can vary along irrelevant aspects, Contrastive Learning from AI Revisions (CLAIR) creates a targeted preference signal instead. B: The quality of the model can impact alignment training, Anchored Preference Optimization (APO) explicitly accounts for this.
Figure 2: An answer produced by Llama-3-8B-Instruct for a prompt, and corresponding GPT4-turbo revision of this answer. The differences between answer and revision are highlighted. The revision generally follows the same outline as the answer but improves it where possible. For example, the revision correctly alters the count of Parisian restaurants from 2 to 3 in the second line of the answer.
Figure 3: Comparison of gradients between DPO (equation A), APO-zero (equation B), and APO-down (equation C). Each gradient term is decomposed in a direction and magnitude factor. Direction: Either APO variant specifies explicitly if winning and losing likelihoods should increase or decrease during training. DPO only increases the likelihood difference, causing ambiguity with regard to the actual movement of these likelihoods during training. This explicit specification of direction is core to APO variants, and allows for a tighter fit between model and data during alignment. Magnitude: Each term in APO is scaled with a delta function. Here, $\delta(x) = \sigma(x)(1-\sigma(x))$ is a function with a global maximum at $x = 0$ that tends to $0$ for $x \to \pm \infty$. This causes APO gradients to saturate whenever the quantities being optimized have changed a lot compared to the beginning of training. ethayarajh2024kto theorize that such scaling leads to more robust optimization.
Figure 4: Log-likelihood and reward on held-out winning and losing outputs for Llama-3-8B-Instruct trained on CLAIR, on-policy judge, off-policy judge, and Stronger Preferred preference datasets, using APO-down, APO-zero, or DPO alignment objectives.

Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

TL;DR

Abstract

Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (4)