Inverse Reward Design
Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, Anca Dragan
TL;DR
This work reframes reward design as an inference problem: the proxy reward is an observation about the designer's true objective within a training MDP. By casting IRD in a Bayesian framework and using an observation model that treats proxies as approximately optimal demonstrations, the authors derive scalable approximations to the IRD posterior and couple them with risk-averse planning. Empirical results in the Lavaland domain show that IRD reduces negative side effects and reward hacking under misspecification, even when rewards are latent or only available through high-dimensional observations. The approach advances value alignment by enabling agents to reason about uncertainty in reward evaluations and to hedge against unknown risks in novel environments.
Abstract
Autonomous agents optimize the reward function we give them. What they don't know is how hard it is for us to design a reward function that actually captures what we want. When designing the reward, we might think of some specific training scenarios, and make sure that the reward will lead to the right behavior in those scenarios. Inevitably, agents encounter new scenarios (e.g., new types of terrain) where optimizing that same reward may lead to undesired behavior. Our insight is that reward functions are merely observations about what the designer actually wants, and that they should be interpreted in the context in which they were designed. We introduce inverse reward design (IRD) as the problem of inferring the true objective based on the designed reward and the training MDP. We introduce approximate methods for solving IRD problems, and use their solution to plan risk-averse behavior in test MDPs. Empirical results suggest that this approach can help alleviate negative side effects of misspecified reward functions and mitigate reward hacking.
