Table of Contents
Fetching ...

Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning

Wen Sun, J. Andrew Bagnell, Byron Boots

TL;DR

The paper tackles sample-inefficient reinforcement learning by fusing imitation learning with RL through cost shaping using a cost-to-go oracle.It introduces THOR, a gradient-based method that optimizes policies over a truncated planning horizon on a reshaped MDP, with horizon length controlled by oracle accuracy.The authors provide theoretical bounds showing that with a perfect oracle the horizon collapses to one (AggreVaTe-like behavior), while with imperfect information a k-step horizon (k>1) can guarantee outperforming the oracle, along with a concrete upper bound.Empirically, THOR achieves superior sample efficiency and competitive or better performance than strong RL and IL baselines across discrete and continuous control tasks, validating the practical benefit of truncated horizon search.

Abstract

In this paper, we propose to combine imitation and reinforcement learning via the idea of reward shaping using an oracle. We study the effectiveness of the near-optimal cost-to-go oracle on the planning horizon and demonstrate that the cost-to-go oracle shortens the learner's planning horizon as function of its accuracy: a globally optimal oracle can shorten the planning horizon to one, leading to a one-step greedy Markov Decision Process which is much easier to optimize, while an oracle that is far away from the optimality requires planning over a longer horizon to achieve near-optimal performance. Hence our new insight bridges the gap and interpolates between imitation learning and reinforcement learning. Motivated by the above mentioned insights, we propose Truncated HORizon Policy Search (THOR), a method that focuses on searching for policies that maximize the total reshaped reward over a finite planning horizon when the oracle is sub-optimal. We experimentally demonstrate that a gradient-based implementation of THOR can achieve superior performance compared to RL baselines and IL baselines even when the oracle is sub-optimal.

Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning

TL;DR

The paper tackles sample-inefficient reinforcement learning by fusing imitation learning with RL through cost shaping using a cost-to-go oracle.It introduces THOR, a gradient-based method that optimizes policies over a truncated planning horizon on a reshaped MDP, with horizon length controlled by oracle accuracy.The authors provide theoretical bounds showing that with a perfect oracle the horizon collapses to one (AggreVaTe-like behavior), while with imperfect information a k-step horizon (k>1) can guarantee outperforming the oracle, along with a concrete upper bound.Empirically, THOR achieves superior sample efficiency and competitive or better performance than strong RL and IL baselines across discrete and continuous control tasks, validating the practical benefit of truncated horizon search.

Abstract

In this paper, we propose to combine imitation and reinforcement learning via the idea of reward shaping using an oracle. We study the effectiveness of the near-optimal cost-to-go oracle on the planning horizon and demonstrate that the cost-to-go oracle shortens the learner's planning horizon as function of its accuracy: a globally optimal oracle can shorten the planning horizon to one, leading to a one-step greedy Markov Decision Process which is much easier to optimize, while an oracle that is far away from the optimality requires planning over a longer horizon to achieve near-optimal performance. Hence our new insight bridges the gap and interpolates between imitation learning and reinforcement learning. Motivated by the above mentioned insights, we propose Truncated HORizon Policy Search (THOR), a method that focuses on searching for policies that maximize the total reshaped reward over a finite planning horizon when the oracle is sub-optimal. We experimentally demonstrate that a gradient-based implementation of THOR can achieve superior performance compared to RL baselines and IL baselines even when the oracle is sub-optimal.

Paper Structure

This paper contains 16 sections, 2 theorems, 19 equations, 3 figures, 1 algorithm.

Key Result

Theorem 3.1

There exists an MDP and an imperfect oracle $\hat{V}^e(s)$ with $|\hat{V}^e(s) - V^*_{\mathcal{M}_0,h}(s)| = \epsilon$, such that the performance of the induced policy from the cost-to-go oracle $\hat{\pi}^*=\arg\min_a \left[c(s,a) + \gamma\mathbb{E}_{s'\sim P_{sa}}[\hat{V}^e(s')]\right]$ is at leas

Figures (3)

  • Figure 1: Reward versus batch iterations of THOR with different $k$ and TRPO-GAE (blue) for Mountain car, Sparse Reward (SR) CartPole, and Acrobot with different horizon. Average rewards across 25 runs are shown in solid lines and averages + std are shown in dotted lines.
  • Figure 2: Reward versus batch iterations of THOR with different $k$ and TRPO-GAE (blue) for Sparse Reward (SR) Inverted Pendulum, Sparse Reward Inverted-Double Pendulum, Swimmer and Hopper. Average rewards across 25 runs are shown in solid lines and averages + std are shown in dotted lines.
  • Figure 3: The special MDP we constructed for theorem \ref{['them:lower_bound']}

Theorems & Definitions (4)

  • Theorem 3.1
  • Theorem 3.2
  • proof
  • proof : Proof of Theorem \ref{['them:ub']}