Table of Contents
Fetching ...

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Thomas Kwa, Drake Thomas, Adrià Garriga-Alonso

TL;DR

A discrete optimization method is adapted to measure the tails of reward models, finding that they are consistent with light-tailed error, but the pervasiveness of heavy-tailed distributions in many real-world applications indicates that future sources of RL reward could have heavy-tailed error, increasing the likelihood of reward hacking even with KL regularization.

Abstract

When applying reinforcement learning from human feedback (RLHF), the reward is learned from data and, therefore, always has some error. It is common to mitigate this by regularizing the policy with KL divergence from a base model, with the hope that balancing reward with regularization will achieve desirable outcomes despite this reward misspecification. We show that when the reward function has light-tailed error, optimal policies under less restrictive KL penalties achieve arbitrarily high utility. However, if error is heavy-tailed, some policies obtain arbitrarily high reward despite achieving no more utility than the base model--a phenomenon we call catastrophic Goodhart. We adapt a discrete optimization method to measure the tails of reward models, finding that they are consistent with light-tailed error. However, the pervasiveness of heavy-tailed distributions in many real-world applications indicates that future sources of RL reward could have heavy-tailed error, increasing the likelihood of reward hacking even with KL regularization.

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

TL;DR

A discrete optimization method is adapted to measure the tails of reward models, finding that they are consistent with light-tailed error, but the pervasiveness of heavy-tailed distributions in many real-world applications indicates that future sources of RL reward could have heavy-tailed error, increasing the likelihood of reward hacking even with KL regularization.

Abstract

When applying reinforcement learning from human feedback (RLHF), the reward is learned from data and, therefore, always has some error. It is common to mitigate this by regularizing the policy with KL divergence from a base model, with the hope that balancing reward with regularization will achieve desirable outcomes despite this reward misspecification. We show that when the reward function has light-tailed error, optimal policies under less restrictive KL penalties achieve arbitrarily high utility. However, if error is heavy-tailed, some policies obtain arbitrarily high reward despite achieving no more utility than the base model--a phenomenon we call catastrophic Goodhart. We adapt a discrete optimization method to measure the tails of reward models, finding that they are consistent with light-tailed error. However, the pervasiveness of heavy-tailed distributions in many real-world applications indicates that future sources of RL reward could have heavy-tailed error, increasing the likelihood of reward hacking even with KL regularization.
Paper Structure (41 sections, 9 theorems, 16 equations, 8 figures, 2 tables)

This paper contains 41 sections, 9 theorems, 16 equations, 8 figures, 2 tables.

Key Result

Theorem 1

Given any heavy-tailed reference distribution $Q$ over $\mathbb R$ with mean $\mu_Q$, and any $M, \epsilon > 0$, there is a distribution $P$ with mean $\mu_P>M$ and $D_{KL}(P \| Q) < \epsilon$.

Figures (8)

  • Figure 1: Plots of the distribution of reward from 30000 random length-1024 token sequences to Starling 7B-alpha. Clockwise from top left: The histogram shows a unimodal distribution with a slight right skew. The normal probability plot indicates the data are heavier-tailed than normal. The Hill estimator (error bars are standard error) appears to be 0.20 for higher values but fluctuates for lower values. The exponential probability plot of the right half of the distribution is consistent with either light or heavy tails (under heavy tails, the slope would go to infinity).
  • Figure 2: Plots of the reward distribution from 16000 token sequences generated by Llama 7B-chat of length $\le 133$, starting with five random tokens. Clockwise from top left: A histogram shows the reward distribution has a left skew. The normal probability plot suggests reward is approximately normal and thus light-tailed. The Hill estimator plot should stabilize if the distribution is heavy-tailed, but it does not; thus, there is no evidence the distribution is heavy-tailed. The exponential probability plot also indicates light tails, because the curve is bending downwards.
  • Figure A.1: As $t \to \infty$, the mean of $X$ (blue bar) grows without bound while KL divergence $D_{KL}(P_t \,\|\, Q)$ (orange bar) goes to 0. The base distribution Q is a Student t-distribution with $df=3$. In this case, high values of X are upweighted to $1/t^{0.8}$; upweighting them to $1/t$ would cause $\mathbb E[X]$ to converge to $1$ while KL divergence goes to zero faster.
  • Figure A.2: A diagram showing the region boundaries at $-h(t)$, $h(t)$, and $t-h(t)$ in an example where $t=25$ and $h(t)=4$, along with a negative log plot of the relevant distribution:
  • Figure B.1: Histogram and normal probability plot of reward assigned by Pythia RM to random length-1024 token sequences. The Q-Q plot suggests the distribution is approximately normal, which is much lighter-tailed than exponential.
  • ...and 3 more figures

Theorems & Definitions (12)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Corollary 1
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Lemma 1
  • proof
  • Lemma 2
  • ...and 2 more