Table of Contents
Fetching ...

Goodhart's Law in Reinforcement Learning

Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Griffin, Joar Skalse

TL;DR

This work investigates how reinforcement learning agents can behave suboptimally with respect to the true objective when trained on imperfect proxy rewards, through the lens of Goodhart's law. It introduces a geometric, occupancy-based framework that recasts policy optimization as a linear program over a convex polytope of occupancy measures and defines a projected reward distance via arg(R0, R1) and a Normalised Drop Height metric to quantify Goodharting. The authors establish a mechanistic explanation for Goodharting in RL, illustrate its ubiquity across diverse environments, and propose two provably robust policy optimization strategies, including an optimal stopping rule with regret guarantees, to avoid the pitfall. They validate these ideas experimentally, showing that early stopping can prevent Goodharting across many setups, albeit at potential cost to true objective performance, and discuss practical considerations for estimating key quantities and extending the framework to reward refinement and broader failure modes.

Abstract

Implementing a reward function that perfectly captures a complex task in the real world is impractical. As a result, it is often appropriate to think of the reward function as a proxy for the true objective rather than as its definition. We study this phenomenon through the lens of Goodhart's law, which predicts that increasing optimisation of an imperfect proxy beyond some critical point decreases performance on the true objective. First, we propose a way to quantify the magnitude of this effect and show empirically that optimising an imperfect proxy reward often leads to the behaviour predicted by Goodhart's law for a wide range of environments and reward functions. We then provide a geometric explanation for why Goodhart's law occurs in Markov decision processes. We use these theoretical insights to propose an optimal early stopping method that provably avoids the aforementioned pitfall and derive theoretical regret bounds for this method. Moreover, we derive a training method that maximises worst-case reward, for the setting where there is uncertainty about the true reward function. Finally, we evaluate our early stopping method experimentally. Our results support a foundation for a theoretically-principled study of reinforcement learning under reward misspecification.

Goodhart's Law in Reinforcement Learning

TL;DR

This work investigates how reinforcement learning agents can behave suboptimally with respect to the true objective when trained on imperfect proxy rewards, through the lens of Goodhart's law. It introduces a geometric, occupancy-based framework that recasts policy optimization as a linear program over a convex polytope of occupancy measures and defines a projected reward distance via arg(R0, R1) and a Normalised Drop Height metric to quantify Goodharting. The authors establish a mechanistic explanation for Goodharting in RL, illustrate its ubiquity across diverse environments, and propose two provably robust policy optimization strategies, including an optimal stopping rule with regret guarantees, to avoid the pitfall. They validate these ideas experimentally, showing that early stopping can prevent Goodharting across many setups, albeit at potential cost to true objective performance, and discuss practical considerations for estimating key quantities and extending the framework to reward refinement and broader failure modes.

Abstract

Implementing a reward function that perfectly captures a complex task in the real world is impractical. As a result, it is often appropriate to think of the reward function as a proxy for the true objective rather than as its definition. We study this phenomenon through the lens of Goodhart's law, which predicts that increasing optimisation of an imperfect proxy beyond some critical point decreases performance on the true objective. First, we propose a way to quantify the magnitude of this effect and show empirically that optimising an imperfect proxy reward often leads to the behaviour predicted by Goodhart's law for a wide range of environments and reward functions. We then provide a geometric explanation for why Goodhart's law occurs in Markov decision processes. We use these theoretical insights to propose an optimal early stopping method that provably avoids the aforementioned pitfall and derive theoretical regret bounds for this method. Moreover, we derive a training method that maximises worst-case reward, for the setting where there is uncertainty about the true reward function. Finally, we evaluate our early stopping method experimentally. Our results support a foundation for a theoretically-principled study of reinforcement learning under reward misspecification.
Paper Structure (33 sections, 15 theorems, 24 equations, 27 figures, 1 table, 1 algorithm)

This paper contains 33 sections, 15 theorems, 24 equations, 27 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

The set $\Omega = \{\mathbf{\eta^{\pi}}: \pi \in \Pi\}$ is the convex hull of the finite set of points corresponding to the deterministic policies $\{\mathbf{\eta^{\pi}}: \pi \in \Pi_0\}$. It lies in an affine subspace of dimension $|S|(|A| - 1)$.

Figures (27)

  • Figure 1: A cartoon of Goodharting.
  • Figure 2:
  • Figure 3: Visualisation of Goodhart's law in case of $\mathcal{M}_{2, 2}$.
  • Figure 4: Early stopping algorithm and its behaviour.
  • Figure 5: (a) Reward lost due to the early stopping ($\diamond$ show groups' medians). (b) The relationship between $\theta$ and the lost reward (shaded area between 25th-75th quantiles), aggregated into 25 buckets.
  • ...and 22 more figures

Theorems & Definitions (27)

  • Definition 1: State-action occupancy measure
  • Proposition 1
  • Definition 2: Projected angle
  • Proposition 2
  • Definition 3: Maximal Causal Entropy
  • Definition 4: Boltzmann Rationality
  • Definition 5: Normalised drop height
  • Proposition 3: Concavity of Steepest Ascent
  • Theorem 1
  • Corollary 1: Optimal Stopping
  • ...and 17 more