Table of Contents
Fetching ...

Reward-rational (implicit) choice: A unifying formalism for reward learning

Hong Jun Jeon, Smitha Milli, Anca D. Dragan

TL;DR

The paper tackles the challenge of hand-specifying reward functions for intelligent agents by introducing reward-rational implicit choice as a unifying formalism. It models human feedback as choices from an implicit or explicit option set, grounded to robot trajectories, and assumes Boltzmann-rational selection to connect observed behavior to an underlying reward. This framework provides a Bayesian-inspired mechanism to infer rewards from a variety of feedback types and shows how disparate methods (comparisons, demonstrations, corrections, language, etc.) fit under a common lens. It also explores implications for combining feedback types and introducing meta-choice, where the choice of feedback type itself leaks information about the true reward, guiding future multi-type reward-learning research.

Abstract

It is often difficult to hand-specify what the correct reward function is for a task, so researchers have instead aimed to learn reward functions from human behavior or feedback. The types of behavior interpreted as evidence of the reward function have expanded greatly in recent years. We've gone from demonstrations, to comparisons, to reading into the information leaked when the human is pushing the robot away or turning it off. And surely, there is more to come. How will a robot make sense of all these diverse types of behavior? Our key insight is that different types of behavior can be interpreted in a single unifying formalism - as a reward-rational choice that the human is making, often implicitly. The formalism offers both a unifying lens with which to view past work, as well as a recipe for interpreting new sources of information that are yet to be uncovered. We provide two examples to showcase this: interpreting a new feedback type, and reading into how the choice of feedback itself leaks information about the reward.

Reward-rational (implicit) choice: A unifying formalism for reward learning

TL;DR

The paper tackles the challenge of hand-specifying reward functions for intelligent agents by introducing reward-rational implicit choice as a unifying formalism. It models human feedback as choices from an implicit or explicit option set, grounded to robot trajectories, and assumes Boltzmann-rational selection to connect observed behavior to an underlying reward. This framework provides a Bayesian-inspired mechanism to infer rewards from a variety of feedback types and shows how disparate methods (comparisons, demonstrations, corrections, language, etc.) fit under a common lens. It also explores implications for combining feedback types and introducing meta-choice, where the choice of feedback type itself leaks information about the true reward, guiding future multi-type reward-learning research.

Abstract

It is often difficult to hand-specify what the correct reward function is for a task, so researchers have instead aimed to learn reward functions from human behavior or feedback. The types of behavior interpreted as evidence of the reward function have expanded greatly in recent years. We've gone from demonstrations, to comparisons, to reading into the information leaked when the human is pushing the robot away or turning it off. And surely, there is more to come. How will a robot make sense of all these diverse types of behavior? Our key insight is that different types of behavior can be interpreted in a single unifying formalism - as a reward-rational choice that the human is making, often implicitly. The formalism offers both a unifying lens with which to view past work, as well as a recipe for interpreting new sources of information that are yet to be uncovered. We provide two examples to showcase this: interpreting a new feedback type, and reading into how the choice of feedback itself leaks information about the reward.

Paper Structure

This paper contains 13 sections, 2 theorems, 30 equations, 6 figures, 2 tables.

Key Result

Theorem A.1

The solution to the satisficing maximum entropy problem is the Boltzmann distribution $\mathbb{P}_\beta(f) \propto \exp (\beta \cdot r(\psi(c)))$ where $\beta$ is the unique value satisfying the satisficing constraint (eq:sat-cond).

Figures (6)

  • Figure 1: Different behavior types described in Sec. \ref{['sec:applying-formalism']} in a gridworld with three features: avoiding/going on the rug, getting the rug dirty, and reaching the goal (green). For each, we display the choices, grounding, and feasible rewards under the constraint formulation of robot inference (\ref{['eq:constraints']}). Each trajectory is a finite horizon path that begins at the start (red). Orange is used to denote $c^*$ and $\psi(c^*)$ while gray to denote other choices $c$ in $\mathcal{C}$. For instance, the comparison affects the feasible reward space by removing the halfspace where going on the rug is good. It does not inform the robot about the goal, because both end at the goal. The demonstration removes the space where the rug is good, where the goal is bad (because alternates do not reach the goal), and where getting the rug dirty is good (because alternates slightly graze the rug). The correction is similar to the demonstration, but does not infer about the goal, since all corrections end at goal.
  • Figure 2: A case study for teaching a reward for robot arm motion using two training environments. The robot trades off efficiency, keeping distance away from the human, and also from the table. We use the constraints interpretation of feedback in this study. We start by defining a proxy reward that produces acceptable behavior (orange trajectories) in the training environments (1st row). This initial feedback significantly prunes the feasible space, but is not enough to guarantee good performance in other environments. On the right, we see trajectories still considered feasible in two test environments. The green one is correct, however, the other feasible trajectories are either too close to the human or too close to the robot. After an improvement feedback and a comparison, the robot shrinks the space of feasible rewards, removing extraneous rewards that produce undesirable behavior at test time.
  • Figure 3: Environments used for experiments on active selection of feedback. (Top) These four environments were used during "training". (Bottom) These four environments were held as a test set to measure maximum and average regret.
  • Figure 4: Statistics computed over 10 iterations of our greedy maximum information gain algorithm. We notice that demonstrations (purple) are initially very information dense but quickly flatten out, whereas comparisons (cyan) obtain more information but less efficiently. We notice that combining the two methods (orange) inherits the positive aspects of both, the efficiency of demonstrations with the precision of comparisons.
  • Figure 5: (Left) Environment with designated start (red circle), goal (green circle) and lava area (red tiles). The human can provide a correction (one of the green trajectories) or turn off the robot, forcing the robot to stop at the marked dot. (Middle) Belief distribution over rewards after the human provides feedback ($\beta_0 = 10.0$). Darker indicates higher probability. The metareasoning model is able to rule out more reward functions than the naive model. (Right) When the human's metareasoning has no signal ($\beta_0 = 0$), then the metareasoning (orange) and naive model (gray) perform equally well. As $\beta_0$ increases, the advantage of the metareasoning model also increases.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Definition 2.1: Reward-rational choice
  • Definition A.1: Satisficing MaxEnt problem
  • Theorem A.1: jaynes1957information
  • Corollary A.1