Sample Efficient Reinforcement Learning by Automatically Learning to Compose Subtasks

Shuai Han; Mehdi Dastani; Shihan Wang

Sample Efficient Reinforcement Learning by Automatically Learning to Compose Subtasks

Shuai Han, Mehdi Dastani, Shihan Wang

TL;DR

This work proposes an RL algorithm that can automatically structure the reward function for sample efficiency, given a set of labels that signify subtasks, and shows that this approach significantly outperforms the state-of-art baselines as the difficulty of the task increases.

Abstract

Improving sample efficiency is central to Reinforcement Learning (RL), especially in environments where the rewards are sparse. Some recent approaches have proposed to specify reward functions as manually designed or learned reward structures whose integrations in the RL algorithms are claimed to significantly improve the learning efficiency. Manually designed reward structures can suffer from inaccuracy and existing automatically learning methods are often computationally intractable for complex tasks. The integration of inaccurate or partial reward structures in RL algorithms fail to learn optimal policies. In this work, we propose an RL algorithm that can automatically structure the reward function for sample efficiency, given a set of labels that signify subtasks. Given such minimal knowledge about the task, we train a high-level policy that selects optimal sub-tasks in each state together with a low-level policy that efficiently learns to complete each sub-task. We evaluate our algorithm in a variety of sparse-reward environments. The experiment results show that our approach significantly outperforms the state-of-art baselines as the difficulty of the task increases.

Sample Efficient Reinforcement Learning by Automatically Learning to Compose Subtasks

TL;DR

Abstract

Paper Structure (15 sections, 7 equations, 6 figures, 2 algorithms)

This paper contains 15 sections, 7 equations, 6 figures, 2 algorithms.

Introduction
Related Work
Problem Setting
Methodology
Two-level policy formalization
Low-level training
High-level training
Overall algorithm
Interpretability
Experiments
Experiment settings
Comparison
Ablation study
Interpretability
Conclusion and future work

Figures (6)

Figure 1: (a) Coffee&mail task on OfficeWorld domain, where 'a', 'c', 'm' and 'o' in the figure indicates the positions of the agent, coffee, mail and office respectively. In this task, the agent is rewarded only when it arrives at the office after taking a coffee and a mail. (b) Reward machine introduced by QRM-j to expose the reward structure of this task to RL agent.
Figure 2: Following the environment and task of Figure \ref{['fig:office']}, here is an example about how we generate the high-level experiences in an episode for updating $Q_h$. As shown in the table, each row contains the corresponding experience data at that time step. The experience is denoted as: $((s_t, p^*_t), p_t, (s_{t+1}, p^*_{t+1}), r_t)$, where $p_t$ here is the assumed selected subtask and $r_t$ is reward from the environment. In this episode, subtask $c$, $m$, and $o$ are achieved at time step $9$, $15$, and $27$, respectively. Then, the assumed subtask $p_t$ selected by $\pi_h$ are $c$ from time step $0\sim8$, $m$ from time step $9\sim14$, and $o$ from time step $15\sim27$.
Figure 3: An example tree to record sequences of subtasks in the environment of Figure \ref{['fig:office']}. This tree is generated during the training process. Except for the root, nodes in this tree refer to the subtasks specified in this environment. The edges present the transferring from one subtask to another subtask (when the parent subtask has been achieved). The reward values on the edges are sampled from the environment based on whether the main task is achieved. Following the task description in Figure \ref{['fig:office']}, '$r=1$' means the task is finished, i.e., the agent arrives at the office after taking both a coffee and a mail. For infinite MDP where the episode step can go to infinity, the tree could be infinite. For finite MDP where the episode step is limited, the tree is finite because the depth of the tree will be limited by the episode steps and the width will be limited by $|\mathcal{P}|$.
Figure 4: Learning curves of various RL algorithms on 8 environments from OfficeWord and MineCraft domains.
Figure 5: Learning curves for ablation experiment of ALCS on 4 environments from OfficeWord.
...and 1 more figures

Sample Efficient Reinforcement Learning by Automatically Learning to Compose Subtasks

TL;DR

Abstract

Sample Efficient Reinforcement Learning by Automatically Learning to Compose Subtasks

Authors

TL;DR

Abstract

Table of Contents

Figures (6)