Table of Contents
Fetching ...

Reinforcement Learning From Imperfect Corrective Actions And Proxy Rewards

Zhaohui Jiang, Xuening Feng, Paul Weng, Yifei Zhu, Yan Song, Tianze Zhou, Yujing Hu, Tangjie Lv, Changjie Fan

TL;DR

A framework where a human labeler can provide additional feedback in the form of corrective actions, which expresses the labeler's action preferences although this feedback may possibly be imperfect as well is considered, which can better align with human preferences and is more sample-efficient than baseline methods.

Abstract

In practice, reinforcement learning (RL) agents are often trained with a possibly imperfect proxy reward function, which may lead to a human-agent alignment issue (i.e., the learned policy either converges to non-optimal performance with low cumulative rewards, or achieves high cumulative rewards but in undesired manner). To tackle this issue, we consider a framework where a human labeler can provide additional feedback in the form of corrective actions, which expresses the labeler's action preferences although this feedback may possibly be imperfect as well. In this setting, to obtain a better-aligned policy guided by both learning signals, we propose a novel value-based deep RL algorithm called Iterative learning from Corrective actions and Proxy rewards (ICoPro), which cycles through three phases: (1) Solicit sparse corrective actions from a human labeler on the agent's demonstrated trajectories; (2) Incorporate these corrective actions into the Q-function using a margin loss to enforce adherence to labeler's preferences; (3) Train the agent with standard RL losses regularized with a margin loss to learn from proxy rewards and propagate the Q-values learned from human feedback. Moreover, another novel design in our approach is to integrate pseudo-labels from the target Q-network to reduce human labor and further stabilize training. We experimentally validate our proposition on a variety of tasks (Atari games and autonomous driving on highway). On the one hand, using proxy rewards with different levels of imperfection, our method can better align with human preferences and is more sample-efficient than baseline methods. On the other hand, facing corrective actions with different types of imperfection, our method can overcome the non-optimality of this feedback thanks to the guidance from proxy reward.

Reinforcement Learning From Imperfect Corrective Actions And Proxy Rewards

TL;DR

A framework where a human labeler can provide additional feedback in the form of corrective actions, which expresses the labeler's action preferences although this feedback may possibly be imperfect as well is considered, which can better align with human preferences and is more sample-efficient than baseline methods.

Abstract

In practice, reinforcement learning (RL) agents are often trained with a possibly imperfect proxy reward function, which may lead to a human-agent alignment issue (i.e., the learned policy either converges to non-optimal performance with low cumulative rewards, or achieves high cumulative rewards but in undesired manner). To tackle this issue, we consider a framework where a human labeler can provide additional feedback in the form of corrective actions, which expresses the labeler's action preferences although this feedback may possibly be imperfect as well. In this setting, to obtain a better-aligned policy guided by both learning signals, we propose a novel value-based deep RL algorithm called Iterative learning from Corrective actions and Proxy rewards (ICoPro), which cycles through three phases: (1) Solicit sparse corrective actions from a human labeler on the agent's demonstrated trajectories; (2) Incorporate these corrective actions into the Q-function using a margin loss to enforce adherence to labeler's preferences; (3) Train the agent with standard RL losses regularized with a margin loss to learn from proxy rewards and propagate the Q-values learned from human feedback. Moreover, another novel design in our approach is to integrate pseudo-labels from the target Q-network to reduce human labor and further stabilize training. We experimentally validate our proposition on a variety of tasks (Atari games and autonomous driving on highway). On the one hand, using proxy rewards with different levels of imperfection, our method can better align with human preferences and is more sample-efficient than baseline methods. On the other hand, facing corrective actions with different types of imperfection, our method can overcome the non-optimality of this feedback thanks to the guidance from proxy reward.
Paper Structure (43 sections, 4 equations, 11 figures, 14 tables, 1 algorithm)

This paper contains 43 sections, 4 equations, 11 figures, 14 tables, 1 algorithm.

Figures (11)

  • Figure 1: ICoPro is an iterative method with three phases in each iteration. It starts with the data collection-phase to collect agent's rollouts. Segments are then sampled from these rollouts and used as queries for the labeler to provide several corrective actions. Following this are two separate phases for policy updating, the finetune- and propagation-phase. Then the updated policy is utilized in the data collection-phase of the next iteration.
  • Figure 2: Experiments on highway over the set of proxy rewards. Subplots in one row compares the performance with respect to one performance metric. $|\mathcal{D}^{L}|$=1.5K.
  • Figure 3: ICoPro facing different types of non-optimality of corrective actions (different colors). Performances are averaged over large/small $\mathcal{D}^{L}$ in \ref{['fig:Atari_diff_rand_avg', 'fig:Atari_diff_ckpt_avg']}.
  • Figure 4: Performances of the scripted labeler and ICoPro-Human in Pong. Each row shows a sequence of state-action pairs. While the scripted labeler prefers to catch the ball with the corner of the paddle, ICoPro-Human prefers to catch the ball with a larger part of the paddle, which is more human-like.
  • Figure 5: Compare baseline methods on Atari in terms of the averaged episode return measured with the raw reward ${r_{A}}$. The shadow indicates the standard deviation over 5 seeds. $N_q$ in titles refer to the number of queries per iteration, and the larger (resp. smaller) ones correspond to the large (resp. small) $\mathcal{D}^{L}$ in \ref{['tab:AtariBaselines']}.
  • ...and 6 more figures