Why Does RLAIF Work At All?

Robin Young

Why Does RLAIF Work At All?

Robin Young

TL;DR

The latent value hypothesis, that pretraining on internet-scale data encodes human values as directions in representation space, and constitutional prompts elicit these latent values into preference judgments, is proposed under a linear model where the constitution acts as a projection operator selecting value-relevant directions.

Abstract

Reinforcement Learning from AI Feedback (RLAIF) enables language models to improve by training on their own preference judgments, yet no theoretical account explains why this self-improvement seemingly works for value learning. We propose the latent value hypothesis, that pretraining on internet-scale data encodes human values as directions in representation space, and constitutional prompts elicit these latent values into preference judgments. We formalize this intuition under a linear model where the constitution acts as a projection operator selecting value-relevant directions. Our analysis yields several results. RLAIF improves alignment when the constitution-activated direction correlates with true values better than the model's default generation direction thus explaining the generation-judgment gap; the ceiling on RLAIF quality is determined by how well representations encode values, which scales with model capacity; and adversarial constitutions exist that can activate anti-social value directions encoded from harmful pretraining data. Our account unifies scattered empirical findings including the refusal direction, low-rank safety subspaces, and RLAIF scaling behavior.

Why Does RLAIF Work At All?

TL;DR

Abstract

Paper Structure (41 sections, 17 theorems, 56 equations)

This paper contains 41 sections, 17 theorems, 56 equations.

Introduction
Related Work
Problem Setup
Representations and Values
Generation and Judgment
Alignment Measure
Self-Improvement Condition
RLAIF as Direction Adjustment
Improvement Condition
The Generation-Judgment Gap
A Toy Example
RLAIF Ceiling
Conjecture on Low-Rank Values
Adversarial Constitutions
Accounting for Existing Evidence
...and 26 more sections

Key Result

Proposition 2

Under Assumptions ass:linear--ass:judgment, DPO on constitutional preferences with KL penalty $\beta$ yields the optimal policy: where $\lambda = 1/\beta$.

Theorems & Definitions (46)

Definition 1: Encoding Quality
Remark 1
Proposition 2: RLAIF Policy
proof
Theorem 3: Self-Improvement
proof
Corollary 4: Self-Improvement Condition
Proposition 5: Generation-Judgment Gap
Theorem 6: RLAIF Ceiling
proof : Proof sketch
...and 36 more

Why Does RLAIF Work At All?

TL;DR

Abstract

Why Does RLAIF Work At All?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (46)