Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Thomas Kwa; Drake Thomas; Adrià Garriga-Alonso

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Thomas Kwa, Drake Thomas, Adrià Garriga-Alonso

TL;DR

A discrete optimization method is adapted to measure the tails of reward models, finding that they are consistent with light-tailed error, but the pervasiveness of heavy-tailed distributions in many real-world applications indicates that future sources of RL reward could have heavy-tailed error, increasing the likelihood of reward hacking even with KL regularization.

Abstract

When applying reinforcement learning from human feedback (RLHF), the reward is learned from data and, therefore, always has some error. It is common to mitigate this by regularizing the policy with KL divergence from a base model, with the hope that balancing reward with regularization will achieve desirable outcomes despite this reward misspecification. We show that when the reward function has light-tailed error, optimal policies under less restrictive KL penalties achieve arbitrarily high utility. However, if error is heavy-tailed, some policies obtain arbitrarily high reward despite achieving no more utility than the base model--a phenomenon we call catastrophic Goodhart. We adapt a discrete optimization method to measure the tails of reward models, finding that they are consistent with light-tailed error. However, the pervasiveness of heavy-tailed distributions in many real-world applications indicates that future sources of RL reward could have heavy-tailed error, increasing the likelihood of reward hacking even with KL regularization.

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

TL;DR

Abstract

Paper Structure (41 sections, 9 theorems, 16 equations, 8 figures, 2 tables)

This paper contains 41 sections, 9 theorems, 16 equations, 8 figures, 2 tables.

Introduction
Background
KL divergence and KL regularization
Heavy-tailed distributions
Reward misspecification and Goodhart's Law
Theoretical results
KL divergence on heavy- and light-tailed distributions
RLHF with KL penalty under heavy-tailed return distribution
Light-tailed + independence imply \\ mathbb EV \\ to \\ infty
Conditioning as alternate model of optimization
Conditioning with heavy-tailed error produces catastrophic Goodhart
Conditioning with light-tailed error produces arbitrarily high utility
Experiments
Results
Discussion and Limitations
...and 26 more sections

Key Result

Theorem 1

Given any heavy-tailed reference distribution $Q$ over $\mathbb R$ with mean $\mu_Q$, and any $M, \epsilon > 0$, there is a distribution $P$ with mean $\mu_P>M$ and $D_{KL}(P \| Q) < \epsilon$.

Figures (8)

Figure 1: Plots of the distribution of reward from 30000 random length-1024 token sequences to Starling 7B-alpha. Clockwise from top left: The histogram shows a unimodal distribution with a slight right skew. The normal probability plot indicates the data are heavier-tailed than normal. The Hill estimator (error bars are standard error) appears to be 0.20 for higher values but fluctuates for lower values. The exponential probability plot of the right half of the distribution is consistent with either light or heavy tails (under heavy tails, the slope would go to infinity).
Figure 2: Plots of the reward distribution from 16000 token sequences generated by Llama 7B-chat of length $\le 133$, starting with five random tokens. Clockwise from top left: A histogram shows the reward distribution has a left skew. The normal probability plot suggests reward is approximately normal and thus light-tailed. The Hill estimator plot should stabilize if the distribution is heavy-tailed, but it does not; thus, there is no evidence the distribution is heavy-tailed. The exponential probability plot also indicates light tails, because the curve is bending downwards.
Figure A.1: As $t \to \infty$, the mean of $X$ (blue bar) grows without bound while KL divergence $D_{KL}(P_t \,\|\, Q)$ (orange bar) goes to 0. The base distribution Q is a Student t-distribution with $df=3$. In this case, high values of X are upweighted to $1/t^{0.8}$; upweighting them to $1/t$ would cause $\mathbb E[X]$ to converge to $1$ while KL divergence goes to zero faster.
Figure A.2: A diagram showing the region boundaries at $-h(t)$, $h(t)$, and $t-h(t)$ in an example where $t=25$ and $h(t)=4$, along with a negative log plot of the relevant distribution:
Figure B.1: Histogram and normal probability plot of reward assigned by Pythia RM to random length-1024 token sequences. The Q-Q plot suggests the distribution is approximately normal, which is much lighter-tailed than exponential.
...and 3 more figures

Theorems & Definitions (12)

Theorem 1
Theorem 2
Theorem 3
Corollary 1
Theorem 4
Theorem 5
Theorem 6
Lemma 1
proof
Lemma 2
...and 2 more

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

TL;DR

Abstract

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (12)