Table of Contents
Fetching ...

On the Sensitivity of Reward Inference to Misspecified Human Models

Joey Hong, Kush Bhatia, Anca Dragan

TL;DR

The paper addresses how sensitive reward inference is to misspecified human behavior models, revealing a worst-case instability where tiny model differences can cause large errors, yet offering a positive stability result under mild, natural assumptions that ties reward error linearly to human-model error. It develops a framework based on MLE for reward inference, introduces two policy-distance notions (worst-case and weighted KL divergence), and proves a linear upper bound on reward error under strong log-concavity. The authors instantiate the bound for biases like false internal dynamics and myopia, and validate the theory with both tabular and continuous control experiments, including real human demonstrations. The findings suggest that better human models can meaningfully improve reward inference, while also outlining practical limitations and future directions for robust reward-learning in the presence of human biases.

Abstract

Inferring reward functions from human behavior is at the center of value alignment - aligning AI objectives with what we, humans, actually want. But doing so relies on models of how humans behave given their objectives. After decades of research in cognitive science, neuroscience, and behavioral economics, obtaining accurate human models remains an open research topic. This begs the question: how accurate do these models need to be in order for the reward inference to be accurate? On the one hand, if small errors in the model can lead to catastrophic error in inference, the entire framework of reward learning seems ill-fated, as we will never have perfect models of human behavior. On the other hand, if as our models improve, we can have a guarantee that reward accuracy also improves, this would show the benefit of more work on the modeling side. We study this question both theoretically and empirically. We do show that it is unfortunately possible to construct small adversarial biases in behavior that lead to arbitrarily large errors in the inferred reward. However, and arguably more importantly, we are also able to identify reasonable assumptions under which the reward inference error can be bounded linearly in the error in the human model. Finally, we verify our theoretical insights in discrete and continuous control tasks with simulated and human data.

On the Sensitivity of Reward Inference to Misspecified Human Models

TL;DR

The paper addresses how sensitive reward inference is to misspecified human behavior models, revealing a worst-case instability where tiny model differences can cause large errors, yet offering a positive stability result under mild, natural assumptions that ties reward error linearly to human-model error. It develops a framework based on MLE for reward inference, introduces two policy-distance notions (worst-case and weighted KL divergence), and proves a linear upper bound on reward error under strong log-concavity. The authors instantiate the bound for biases like false internal dynamics and myopia, and validate the theory with both tabular and continuous control experiments, including real human demonstrations. The findings suggest that better human models can meaningfully improve reward inference, while also outlining practical limitations and future directions for robust reward-learning in the presence of human biases.

Abstract

Inferring reward functions from human behavior is at the center of value alignment - aligning AI objectives with what we, humans, actually want. But doing so relies on models of how humans behave given their objectives. After decades of research in cognitive science, neuroscience, and behavioral economics, obtaining accurate human models remains an open research topic. This begs the question: how accurate do these models need to be in order for the reward inference to be accurate? On the one hand, if small errors in the model can lead to catastrophic error in inference, the entire framework of reward learning seems ill-fated, as we will never have perfect models of human behavior. On the other hand, if as our models improve, we can have a guarantee that reward accuracy also improves, this would show the benefit of more work on the modeling side. We study this question both theoretically and empirically. We do show that it is unfortunately possible to construct small adversarial biases in behavior that lead to arbitrarily large errors in the inferred reward. However, and arguably more importantly, we are also able to identify reasonable assumptions under which the reward inference error can be bounded linearly in the error in the human model. Finally, we verify our theoretical insights in discrete and continuous control tasks with simulated and human data.
Paper Structure (27 sections, 4 theorems, 32 equations, 12 figures)

This paper contains 27 sections, 4 theorems, 32 equations, 12 figures.

Key Result

Theorem 2

For any MDP $\mathcal{M}$ with continuous actions, policy error $\varepsilon > 0$, assumed model $\widetilde{\pi}$, and dataset $\mathcal{D}$, there exists a demonstrator policy $\pi^*$ that likely generates $\mathcal{D}$ such that the worst-case policy divergence satisfies $d_{\pi}^{\mathsf{wc}}(\p

Figures (12)

  • Figure 1: Effect of transition error (measured as the degree of underestimation of unintended transitions) on (a) weighted policy divergence and (b) reward inference error on three Gridworld environments (A,B,C). In (c), we show a scatter plot of the policy and reward errors for different biased transition model. Note that small policy divergence results in small reward inference error.
  • Figure 2: Effect of discount error in the three Gridworld environments. Like \ref{['fig:gridworld_transition']}, we see a strong correlation between policy divergence and reward inference error.
  • Figure 3: Gridworld environments.
  • Figure 4: Effect of transition error (measured as error in $p$) in the continuous Lunar Lander environments. The results are consistent with earlier Gridworld results.
  • Figure 5: Effect of the simulated human learning bias in the continuous Lunar Lander environments.
  • ...and 7 more figures

Theorems & Definitions (5)

  • Definition 1
  • Theorem 2
  • Theorem 3
  • Corollary 4
  • Corollary 5