Table of Contents
Fetching ...

Why Does RLAIF Work At All?

Robin Young

TL;DR

The latent value hypothesis, that pretraining on internet-scale data encodes human values as directions in representation space, and constitutional prompts elicit these latent values into preference judgments, is proposed under a linear model where the constitution acts as a projection operator selecting value-relevant directions.

Abstract

Reinforcement Learning from AI Feedback (RLAIF) enables language models to improve by training on their own preference judgments, yet no theoretical account explains why this self-improvement seemingly works for value learning. We propose the latent value hypothesis, that pretraining on internet-scale data encodes human values as directions in representation space, and constitutional prompts elicit these latent values into preference judgments. We formalize this intuition under a linear model where the constitution acts as a projection operator selecting value-relevant directions. Our analysis yields several results. RLAIF improves alignment when the constitution-activated direction correlates with true values better than the model's default generation direction thus explaining the generation-judgment gap; the ceiling on RLAIF quality is determined by how well representations encode values, which scales with model capacity; and adversarial constitutions exist that can activate anti-social value directions encoded from harmful pretraining data. Our account unifies scattered empirical findings including the refusal direction, low-rank safety subspaces, and RLAIF scaling behavior.

Why Does RLAIF Work At All?

TL;DR

The latent value hypothesis, that pretraining on internet-scale data encodes human values as directions in representation space, and constitutional prompts elicit these latent values into preference judgments, is proposed under a linear model where the constitution acts as a projection operator selecting value-relevant directions.

Abstract

Reinforcement Learning from AI Feedback (RLAIF) enables language models to improve by training on their own preference judgments, yet no theoretical account explains why this self-improvement seemingly works for value learning. We propose the latent value hypothesis, that pretraining on internet-scale data encodes human values as directions in representation space, and constitutional prompts elicit these latent values into preference judgments. We formalize this intuition under a linear model where the constitution acts as a projection operator selecting value-relevant directions. Our analysis yields several results. RLAIF improves alignment when the constitution-activated direction correlates with true values better than the model's default generation direction thus explaining the generation-judgment gap; the ceiling on RLAIF quality is determined by how well representations encode values, which scales with model capacity; and adversarial constitutions exist that can activate anti-social value directions encoded from harmful pretraining data. Our account unifies scattered empirical findings including the refusal direction, low-rank safety subspaces, and RLAIF scaling behavior.
Paper Structure (41 sections, 17 theorems, 56 equations)

This paper contains 41 sections, 17 theorems, 56 equations.

Key Result

Proposition 2

Under Assumptions ass:linear--ass:judgment, DPO on constitutional preferences with KL penalty $\beta$ yields the optimal policy: where $\lambda = 1/\beta$.

Theorems & Definitions (46)

  • Definition 1: Encoding Quality
  • Remark 1
  • Proposition 2: RLAIF Policy
  • proof
  • Theorem 3: Self-Improvement
  • proof
  • Corollary 4: Self-Improvement Condition
  • Proposition 5: Generation-Judgment Gap
  • Theorem 6: RLAIF Ceiling
  • proof : Proof sketch
  • ...and 36 more