Understanding the Learning Dynamics of Alignment with Human Feedback

Shawn Im; Yixuan Li

Understanding the Learning Dynamics of Alignment with Human Feedback

Shawn Im, Yixuan Li

TL;DR

The paper addresses how learning to align LLMs with human preferences via Direct Preference Optimization (DPO) unfolds, showing that the distribution and distinguishability of preferred vs. non-preferred data govern update rates and achievable accuracy. By formalizing a setup with alpha-subexponential embedding distributions and deriving bounds on weight updates and decision boundary progress, the authors prove a priority effect where more distinguishable behaviors are learned faster. They corroborate theory with experiments on Llama-2-7B and Mistral-7B across persona-based tasks, revealing that alignment can inadvertently facilitate misalignment when starting from aligned models. The work provides actionable insights for data collection and training design to balance learning across heterogeneous behaviors and to mitigate vulnerability to misuse. Overall, this study advances a theoretical foundation for alignment dynamics and informs practical, safer deployment of aligned LLMs.

Abstract

Aligning large language models (LLMs) with human intentions has become a critical task for safely deploying models in real-world systems. While existing alignment approaches have seen empirical success, theoretically understanding how these methods affect model behavior remains an open question. Our work provides an initial attempt to theoretically analyze the learning dynamics of human preference alignment. We formally show how the distribution of preference datasets influences the rate of model updates and provide rigorous guarantees on the training accuracy. Our theory also reveals an intricate phenomenon where the optimization is prone to prioritizing certain behaviors with higher preference distinguishability. We empirically validate our findings on contemporary LLMs and alignment tasks, reinforcing our theoretical insights and shedding light on considerations for future alignment approaches. Disclaimer: This paper contains potentially offensive text; reader discretion is advised.

Understanding the Learning Dynamics of Alignment with Human Feedback

TL;DR

Abstract

Paper Structure (48 sections, 3 theorems, 66 equations, 24 figures)

This paper contains 48 sections, 3 theorems, 66 equations, 24 figures.

Introduction
Preliminaries
Notations.
RLHF Overview.
Direct Preference Optimization.
A Case Study on DPO's Learning Dynamics
Task.
Dataset and Training.
Observation on Learning Dynamics.
Theoretical Insights
Setup
Characterize the Preference Distributions.
Impact of Preference Distinguishability
Interpretation and Verification.
Implication: Priority Levels for Heterogeneous Behaviors.
...and 33 more sections

Key Result

Theorem 4.1

When $\max_{i \in \{+, -\}} \left\| \Sigma_i \right\| \leq c_v\sqrt{d}$ and that $\max_{i \in \{+, -\}}(\left\| \mu_i \right\| + \mathop{\mathrm{Tr}}\nolimits(\Sigma_i)^{1/2}) \leq c_n \sqrt{d}$, let $\beta = \beta' d^{-\frac{1}{2}}$ and $\eta$ be a constant such that $\beta'^2 \eta c_n^2 \leq \frac where $c_v, c_n, \beta', c' > 0$ are some constants, $\gamma= n/\sqrt{d}$, and $\Delta \leq 1/2$.

Figures (24)

Figure 1: Examples of positive and negative statements for the persona "openness" in the Anthropic dataset perez2022discovering.
Figure 2: UMAP visualization of the last hidden state embeddings for positive (green) and negative (gray) statements of three behaviors from the Anthropic Persona dataset.
Figure 3: Training loss curves for 5 behaviors ordered from least distinguishable (Behavior 1) to most distinguishable (Behavior 5) when applying DPO objective. The weights in the unembedding layer are optimized using SGD.
Figure 4: Empirical measurement of $\left\| W_U(t) - W_U(0) \right\|$ for 5 behaviors, ordered from the least distinguishable (purple) to the most distinguishable (yellow) when training with DPO objective. The weights in the unembedding layer are optimized using SGD.
Figure 5: Impact of prioritization when training using DPO objective on two behaviors of differing distinguishability. (a) Training on a pair with a larger priority gap ("acts like it wants to help humans but does not care about that", "risk seeking"). (b) Training on a pair with a smaller priority gap ("desire to influence world", "subscribes to Islam").
...and 19 more figures

Theorems & Definitions (3)

Theorem 4.1
Theorem 4.2
Theorem 4.3

Understanding the Learning Dynamics of Alignment with Human Feedback

TL;DR

Abstract

Understanding the Learning Dynamics of Alignment with Human Feedback

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (24)

Theorems & Definitions (3)