Table of Contents
Fetching ...

Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations

Daniel S. Brown, Wonjoon Goo, Prabhat Nagarajan, Scott Niekum

TL;DR

This work tackles the limitation that inverse reinforcement learning often cannot surpass a suboptimal demonstrator by leveraging ranked demonstrations to extrapolate the underlying intent. The authors introduce T-REX, which learns a neural reward function from trajectory rankings using a partial-trajectory, Bradley-Terry-based loss, and then optimizes policies with PPO on this learned reward. Across MuJoCo and Atari benchmarks, T-REX consistently outperforms the best demonstrations and strong baselines, demonstrating robust extrapolation, resilience to ranking noise, and even success with time-based and human rankings. The results indicate that ranking-driven reward extrapolation can enable high-dimensional agents to achieve significantly better-than-demonstrator performance without ground-truth rewards or expert supervision during policy learning.

Abstract

A critical flaw of existing inverse reinforcement learning (IRL) methods is their inability to significantly outperform the demonstrator. This is because IRL typically seeks a reward function that makes the demonstrator appear near-optimal, rather than inferring the underlying intentions of the demonstrator that may have been poorly executed in practice. In this paper, we introduce a novel reward-learning-from-observation algorithm, Trajectory-ranked Reward EXtrapolation (T-REX), that extrapolates beyond a set of (approximately) ranked demonstrations in order to infer high-quality reward functions from a set of potentially poor demonstrations. When combined with deep reinforcement learning, T-REX outperforms state-of-the-art imitation learning and IRL methods on multiple Atari and MuJoCo benchmark tasks and achieves performance that is often more than twice the performance of the best demonstration. We also demonstrate that T-REX is robust to ranking noise and can accurately extrapolate intention by simply watching a learner noisily improve at a task over time.

Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations

TL;DR

This work tackles the limitation that inverse reinforcement learning often cannot surpass a suboptimal demonstrator by leveraging ranked demonstrations to extrapolate the underlying intent. The authors introduce T-REX, which learns a neural reward function from trajectory rankings using a partial-trajectory, Bradley-Terry-based loss, and then optimizes policies with PPO on this learned reward. Across MuJoCo and Atari benchmarks, T-REX consistently outperforms the best demonstrations and strong baselines, demonstrating robust extrapolation, resilience to ranking noise, and even success with time-based and human rankings. The results indicate that ranking-driven reward extrapolation can enable high-dimensional agents to achieve significantly better-than-demonstrator performance without ground-truth rewards or expert supervision during policy learning.

Abstract

A critical flaw of existing inverse reinforcement learning (IRL) methods is their inability to significantly outperform the demonstrator. This is because IRL typically seeks a reward function that makes the demonstrator appear near-optimal, rather than inferring the underlying intentions of the demonstrator that may have been poorly executed in practice. In this paper, we introduce a novel reward-learning-from-observation algorithm, Trajectory-ranked Reward EXtrapolation (T-REX), that extrapolates beyond a set of (approximately) ranked demonstrations in order to infer high-quality reward functions from a set of potentially poor demonstrations. When combined with deep reinforcement learning, T-REX outperforms state-of-the-art imitation learning and IRL methods on multiple Atari and MuJoCo benchmark tasks and achieves performance that is often more than twice the performance of the best demonstration. We also demonstrate that T-REX is robust to ranking noise and can accurately extrapolate intention by simply watching a learner noisily improve at a task over time.

Paper Structure

This paper contains 35 sections, 3 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: T-REX takes a sequence of ranked demonstrations and learns a reward function from these rankings that allows policy improvement over the demonstrator via reinforcement learning.
  • Figure 1: HalfCheetah policy visualization. For each subplot, (top) is the best given demonstration policy in a stage, and (bottom) is the trained policy with a T-REX reward function.
  • Figure 2: Imitation learning performance for three robotic locomotion tasks when given suboptimal demonstrations. Performance is measured as the total distance traveled, as measured by the final x-position of the robot's body. For each stage and task, the best performance given suboptimal demonstrations is shown for T-REX (ours), BCO torabi2018behavioral, and GAIL ho2016generative. The dashed line shows the performance of the best demonstration.
  • Figure 2: Maximum and minimum predicted observations and corresponding attention maps for Beam Rider. The observation with the maximum predicted reward shows successfully destroying an enemy ship, with the network paying attention to the oncoming enemy ships and the shot that was fired to destroy the enemy ship. The observation with minimum predicted reward shows an enemy shot that destroys the player's ship and causes the player to lose a life. The network attends most strongly to the enemy ships but also to the incoming shot.
  • Figure 3: Extrapolation plots for T-REX on MuJoCo Stage 1 demonstrations. Red points correspond to demonstrations and blue points correspond to trajectories not given as demonstrations. The solid line represents the performance range of the demonstrator, and the dashed line represents extrapolation beyond the demonstrator's performance. The x-axis is the ground-truth return and the y-axis is the predicted return from our learned reward function. Predicted returns are normalized to have the same scale as the ground-truth returns.
  • ...and 9 more figures