Table of Contents
Fetching ...

When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback

Leon Lang, Davis Foote, Stuart Russell, Anca Dragan, Erik Jenner, Scott Emmons

TL;DR

Modeling the human as Boltzmann-rational w.r.t. a belief over trajectories, it is proved conditions under which RLHF is guaranteed to result in policies that deceptively inflate their performance, overjustify their behavior to make an impression, or both.

Abstract

Past analyses of reinforcement learning from human feedback (RLHF) assume that the human evaluators fully observe the environment. What happens when human feedback is based only on partial observations? We formally define two failure cases: deceptive inflation and overjustification. Modeling the human as Boltzmann-rational w.r.t. a belief over trajectories, we prove conditions under which RLHF is guaranteed to result in policies that deceptively inflate their performance, overjustify their behavior to make an impression, or both. Under the new assumption that the human's partial observability is known and accounted for, we then analyze how much information the feedback process provides about the return function. We show that sometimes, the human's feedback determines the return function uniquely up to an additive constant, but in other realistic cases, there is irreducible ambiguity. We propose exploratory research directions to help tackle these challenges, experimentally validate both the theoretical concerns and potential mitigations, and caution against blindly applying RLHF in partially observable settings.

When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback

TL;DR

Modeling the human as Boltzmann-rational w.r.t. a belief over trajectories, it is proved conditions under which RLHF is guaranteed to result in policies that deceptively inflate their performance, overjustify their behavior to make an impression, or both.

Abstract

Past analyses of reinforcement learning from human feedback (RLHF) assume that the human evaluators fully observe the environment. What happens when human feedback is based only on partial observations? We formally define two failure cases: deceptive inflation and overjustification. Modeling the human as Boltzmann-rational w.r.t. a belief over trajectories, we prove conditions under which RLHF is guaranteed to result in policies that deceptively inflate their performance, overjustify their behavior to make an impression, or both. Under the new assumption that the human's partial observability is known and accounted for, we then analyze how much information the feedback process provides about the return function. We show that sometimes, the human's feedback determines the return function uniquely up to an additive constant, but in other realistic cases, there is irreducible ambiguity. We propose exploratory research directions to help tackle these challenges, experimentally validate both the theoretical concerns and potential mitigations, and caution against blindly applying RLHF in partially observable settings.
Paper Structure (43 sections, 29 theorems, 125 equations, 9 figures, 6 tables)

This paper contains 43 sections, 29 theorems, 125 equations, 9 figures, 6 tables.

Key Result

Proposition 3.1

Let $R$ be the true reward function and $G$ the corresponding return function. Then the collection of all choice probabilities $P^{R}(\vec{s} \succ \vec{s}\space')$ for state sequence pairs $\vec{s}, \vec{s}\space' \in \vec{\mathcal{S}}$ determines the return function $G$ on sequences $\vec{s} \in \

Figures (9)

  • Figure 1: Partial observability in ChatGPT Plugins_2023. Users do not observe the online content that ChatGPT observes yet still provide thumbs-up thumbs-down feedback. OpenAI's privacy policy privacy_policy_oai allows user feedback to be used for training models. We show in \ref{['thm:rlhf_deceptive_overjustification']} that if feedback of human evaluators is based on partial observations, then this can lead to deceptive and overjustifying behavior by the language model.
  • Figure 2: A human compares trajectories to provide data for RLHF. Rather than observing $\vec{s}$ and $\vec{s}\space'$, the human sees observations $\vec{o}$ and $\vec{o}\space'$, which they use to estimate the total reward of each trajectory. In this intentionally simple example, an agent executes shell commands to install Nvidia drivers and CUDA. Both $\vec{s}$ and $\vec{s}\space'$ contain an error, but in $\vec{s}\space'$, the agent hides the error. The human believes $\vec{s}\space'$ is better than $\vec{s}$, rewarding the agent's deceptive behavior. The underlying MDP and observation function are in \ref{['fig:expanded_figure2a']}.
  • Figure 3: Behaviors defined by increasing and decreasing the human's over- and underestimation error. RLHF with partial observations results in incentives to increase overestimation error and decrease underestimation error (\ref{['thm:rlhf_deceptive_overjustification']}).
  • Figure 4: Scenarios illustrating failure modes due to partial observability. In each, the agent must install two packages. Formal details of the underlying MDPs are provided in \ref{['sec:concrete_failures_math']}. A, top: In the absence of a log message about CUDA, the human is unsure whether the agent skipped it or used the 2> /dev/null trick (see \ref{['fig:figure1']}); if the human is insufficiently skeptical, the trick looks optimal to the agent. B, bottom: Default logging in this case is silent when the NumPy install is successful. The agent can optionally use a ----verbose flag, but this produces a long log that the human prefers not to see. If the human is too skeptical, verbose logging still appears optimal to the agent.
  • Figure 5: Example A: The larger the reward penalty for hiding errors with 2> /dev/null , and the larger the human's belief that the agent used 2> /dev/null upon seeing an empty log ($p_{\text{hide}}$), the more we expect the agent to install CUDA with default logging in Example A. In \ref{['ex:should-attempt-cuda-app']}, we compute a precise theoretical threshold where the behavior should switch. This perfectly agrees with empirical findings. Example B: The larger the reward penalty for verbose logging, and the larger the human's trust that the agent installed NumPy upon seeing an empty log ($p_{\text{default}}$), the more we expect the agent to skip the NumPy installation entirely. In \ref{['ex:verbose-numpy-app']}, we compute a precise theoretical threshold where behavior should switch. Except four cases of "verbose logging" where the theory predicted the agent to skip the NumPy installation, this agrees with empirical findings. See \ref{['sec:experimental_details']} for experimental details.
  • ...and 4 more figures

Theorems & Definitions (79)

  • Proposition 3.1: Skalse2022Invariance
  • Proposition 4.1
  • Definition 4.2: Overestimation and Underestimation Error
  • Definition 4.3: Deceptive Inflation
  • Definition 4.4: Overjustification
  • theorem 4.5
  • Definition 5.1
  • theorem 5.2
  • Definition 5.3: Ambiguity
  • theorem 5.4
  • ...and 69 more