Table of Contents
Fetching ...

Inverse Reward Design

Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, Anca Dragan

TL;DR

This work reframes reward design as an inference problem: the proxy reward is an observation about the designer's true objective within a training MDP. By casting IRD in a Bayesian framework and using an observation model that treats proxies as approximately optimal demonstrations, the authors derive scalable approximations to the IRD posterior and couple them with risk-averse planning. Empirical results in the Lavaland domain show that IRD reduces negative side effects and reward hacking under misspecification, even when rewards are latent or only available through high-dimensional observations. The approach advances value alignment by enabling agents to reason about uncertainty in reward evaluations and to hedge against unknown risks in novel environments.

Abstract

Autonomous agents optimize the reward function we give them. What they don't know is how hard it is for us to design a reward function that actually captures what we want. When designing the reward, we might think of some specific training scenarios, and make sure that the reward will lead to the right behavior in those scenarios. Inevitably, agents encounter new scenarios (e.g., new types of terrain) where optimizing that same reward may lead to undesired behavior. Our insight is that reward functions are merely observations about what the designer actually wants, and that they should be interpreted in the context in which they were designed. We introduce inverse reward design (IRD) as the problem of inferring the true objective based on the designed reward and the training MDP. We introduce approximate methods for solving IRD problems, and use their solution to plan risk-averse behavior in test MDPs. Empirical results suggest that this approach can help alleviate negative side effects of misspecified reward functions and mitigate reward hacking.

Inverse Reward Design

TL;DR

This work reframes reward design as an inference problem: the proxy reward is an observation about the designer's true objective within a training MDP. By casting IRD in a Bayesian framework and using an observation model that treats proxies as approximately optimal demonstrations, the authors derive scalable approximations to the IRD posterior and couple them with risk-averse planning. Empirical results in the Lavaland domain show that IRD reduces negative side effects and reward hacking under misspecification, even when rewards are latent or only available through high-dimensional observations. The approach advances value alignment by enabling agents to reason about uncertainty in reward evaluations and to hedge against unknown risks in novel environments.

Abstract

Autonomous agents optimize the reward function we give them. What they don't know is how hard it is for us to design a reward function that actually captures what we want. When designing the reward, we might think of some specific training scenarios, and make sure that the reward will lead to the right behavior in those scenarios. Inevitably, agents encounter new scenarios (e.g., new types of terrain) where optimizing that same reward may lead to undesired behavior. Our insight is that reward functions are merely observations about what the designer actually wants, and that they should be interpreted in the context in which they were designed. We introduce inverse reward design (IRD) as the problem of inferring the true objective based on the designed reward and the training MDP. We introduce approximate methods for solving IRD problems, and use their solution to plan risk-averse behavior in test MDPs. Empirical results suggest that this approach can help alleviate negative side effects of misspecified reward functions and mitigate reward hacking.

Paper Structure

This paper contains 12 sections, 1 theorem, 8 equations, 5 figures.

Key Result

Proposition 1

The posterior distribution that the IRD model induces on ${w^*}$ (i.e., Equation eq-irdp) and the posterior distribution induced by IRL (i.e., Equation eq-irl) are invariant to linear translations of the features in the training MDP.

Figures (5)

  • Figure 1: An illustration of a negative side effect. Alice designs a reward function so that her robot navigates to the pot of gold and prefers dirt paths. She does not consider that her robot might encounter lava in the real world and leaves that out of her reward specification. The robot maximizing this proxy reward function drives through the lava to its demise. In this work, we formalize the (Bayesian) inverse reward design (IRD) problem as the problem of inferring (a distribution on) the true reward function from the proxy. We show that IRD can help mitigate unintended consequences from misspecified reward functions like negative side effects and reward hacking.
  • Figure 2: An example from the Lavaland domain. Left: The training MDP where the designer specifies a proxy reward function. This incentivizes movement toward targets (yellow) while preferring dirt (brown) to grass (green), and generates the gray trajectory. Middle: The testing MDP has lava (red). The proxy does not penalize lava, so optimizing it makes the agent go straight through (gray). This is a negative side effect, which the IRD agent avoids (blue): it treats the proxy as an observation in the context of the training MDP, which makes it realize that it cannot trust the (implicit) weight on lava. Right: The testing MDP has cells in which two sensor indicators no longer correlate: they look like grass to one sensor but target to the other. The proxy puts weight on the first, so the literal agent goes to these cells (gray). The IRD agent knows that it can't trust the distinction and goes to the target on which both sensors agree (blue).
  • Figure 3: Our challenge domain with latent rewards. Each terrain type (grass, dirt, target, lava) induces a different distribution over high-dimensional features: $\phi_{s} \sim \mathcal{N}(\mu_{I_{s}}, \Sigma_{I_{s}})$. The designer never builds an indicator for lava, and yet the agent still needs to avoid it in the test MDPs.
  • Figure 4: The results of our experiment comparing our proposed method to a baseline that directly plans with the proxy reward function. By solving an inverse reward design problem, we are able to create generic incentives to avoid unseen or novel states.
  • Figure 5: Left: We avoid side effects and reward hacking by computing a posterior distribution over reward function and then find a trajectory that performs well under the worst case reward function. This illustrates the impact of selecting this worst case independently per time step or once for the entire trajectory. Taking the minimum per time step increases robustness to the approximate inference algorithms used because we only need one particle in our sample posterior to capture the worst case for each grid cell type. For the full trajectory, we need a single particle to have inferred a worst case for every grid cell type at once. Right: The impact of changing the offsets $c_i$. "Initial State" fixes the value of the start state to be 0. "Training Feature Counts" sets an average feature value from the training MDP to be 0. "Log Z(w)" offsets each evaluation by the normalizing from the maximum entropy trajectory distribution. This means that the sum of rewards across a trajectory is the log probability of a trajectory.

Theorems & Definitions (5)

  • Definition 1
  • Definition 2
  • Definition 3
  • Proposition 1
  • proof