Table of Contents
Fetching ...

Supervised Reward Inference

Will Schwarzer, Jordan Schneider, Philip S. Thomas, Scott Niekum

TL;DR

It is proposed that supervised learning offers a unified framework to infer reward functions from any class of behavior, and it is shown that such an approach is asymptotically Bayes-optimal under mild assumptions.

Abstract

Existing approaches to reward inference from behavior typically assume that humans provide demonstrations according to specific models of behavior. However, humans often indicate their goals through a wide range of behaviors, from actions that are suboptimal due to poor planning or execution to behaviors which are intended to communicate goals rather than achieve them. We propose that supervised learning offers a unified framework to infer reward functions from any class of behavior, and show that such an approach is asymptotically Bayes-optimal under mild assumptions. Experiments on simulated robotic manipulation tasks show that our method can efficiently infer rewards from a wide variety of arbitrarily suboptimal demonstrations.

Supervised Reward Inference

TL;DR

It is proposed that supervised learning offers a unified framework to infer reward functions from any class of behavior, and it is shown that such an approach is asymptotically Bayes-optimal under mild assumptions.

Abstract

Existing approaches to reward inference from behavior typically assume that humans provide demonstrations according to specific models of behavior. However, humans often indicate their goals through a wide range of behaviors, from actions that are suboptimal due to poor planning or execution to behaviors which are intended to communicate goals rather than achieve them. We propose that supervised learning offers a unified framework to infer reward functions from any class of behavior, and show that such an approach is asymptotically Bayes-optimal under mild assumptions. Experiments on simulated robotic manipulation tasks show that our method can efficiently infer rewards from a wide variety of arbitrarily suboptimal demonstrations.

Paper Structure

This paper contains 30 sections, 3 theorems, 36 equations, 4 figures, 4 tables, 1 algorithm.

Key Result

Lemma A.3

Let $M_R = (\mathcal{S}, \mathcal{A}, p, R, d_0, \gamma)$ be an MDP with random reward function $R$ following any distribution $P_R \in \mathcal{P}(\mathcal{R})$. Define the expected reward function $\bar{R}: \mathcal{S} \rightarrow \mathbb{R}$ to be $\bar{R}(s) = \mathbb{E}[R(s)]$. Then, the policy is an optimal policy for the MDP $M_{\bar{R}} = (\mathcal{S}, \mathcal{A}, p, \bar{R}, d_0, \gamma)

Figures (4)

  • Figure 1: Example SRI model architecture. Behavior trajectories are processed independently into trajectory representations by a sequence model such as a transformer, and then these representations are combined into an overall task representation $\psi$ by a set model such as a set transformer lee2019set (blue path). This computation is done only once per task. Independently, the current state is processed into a representation $\phi$ by a standard multi-layer perceptron or convolutional neural network (red path). This process is done once per timestep, but is very fast: it consists of one forward pass through two small MLPs. Finally, the task and observation representations are combined by another multi-layer perceptron into a scalar reward.
  • Figure 2: Performance of SRI and baselines when given noisy demonstrations. Error bars indicate standard error over 30 trials. Tasks are Meta-World reach tasks with demonstrations from the noisy gesture$_\varepsilon$ class for various values of $\varepsilon$ (see Section \ref{['sec:tasks']}). Performance is measured by the robot hand's average proximity to the goal under each method's learned policy, clipped per-trial to a minimum of 0 (see Section \ref{['sec:metrics']}), with 30-trial standard error bars. Note that $\varepsilon=1.0$ is effectively impossible, as demonstrations are pure noise, and is only included for completeness. Results show that SRI approaches ground-truth RL performance in the presence of perfect demonstrations, and suffers less from noisily suboptimal demonstrations than other methods.
  • Figure 3: Performance of SRI and baselines when given demonstrations that deterministically reach to the wrong location. Tasks are Meta-World reach tasks with demonstrations from the psychic$_\alpha$ class (see Section \ref{['sec:tasks']}) for various values of $\alpha$. Performance is measured by the robot hand's average proximity to the goal under each method's learned policy, clipped per-trial to a minimum of 0 (see Section \ref{['sec:metrics']}), with 30-trial standard error bars. Results show that the performance of optimality-assuming algorithms decreases to zero with suboptimality of the demonstrations, while SRI's learned policies remain nearly optimal regardless of demonstration optimality.
  • Figure 4: Performance of SRI and baselines when given varying numbers of noisy demonstrations. Tasks are Meta-World reach tasks, with demonstrations from the noisy$_{0.87}$ class (see Section \ref{['sec:tasks']}). Performance is measured by the robot hand's average proximity to the goal under each method's learned policy, clipped per-trial to a minimum of 0 (see Section \ref{['sec:metrics']}), with 30-trial standard error bars. Results show that SRI performs better than optimality-assuming methods regardless of demonstration quantity.

Theorems & Definitions (13)

  • Definition 4.1: Supervised Reward Inference
  • proof : Proof sketch
  • Remark A.1
  • Remark A.2
  • Lemma A.3
  • proof
  • Remark A.4
  • Remark A.5
  • Remark A.6
  • Lemma A.7: Pointwise Convergence of $R_{\theta_K}$
  • ...and 3 more