Subwords as Skills: Tokenization for Sparse-Reward Reinforcement Learning

David Yunis; Justin Jung; Falcon Dai; Matthew Walter

Subwords as Skills: Tokenization for Sparse-Reward Reinforcement Learning

David Yunis, Justin Jung, Falcon Dai, Matthew Walter

TL;DR

This work proposes a novel approach to skill-generation with two components that outperforms baselines for skill-generation in several challenging sparse-reward domains, and requires orders-of-magnitude less computation in skill-generation and online rollouts.

Abstract

Exploration in sparse-reward reinforcement learning is difficult due to the requirement of long, coordinated sequences of actions in order to achieve any reward. Moreover, in continuous action spaces there are an infinite number of possible actions, which only increases the difficulty of exploration. One class of methods designed to address these issues forms temporally extended actions, often called skills, from interaction data collected in the same domain, and optimizes a policy on top of this new action space. Typically such methods require a lengthy pretraining phase, especially in continuous action spaces, in order to form the skills before reinforcement learning can begin. Given prior evidence that the full range of the continuous action space is not required in such tasks, we propose a novel approach to skill-generation with two components. First we discretize the action space through clustering, and second we leverage a tokenization technique borrowed from natural language processing to generate temporally extended actions. Such a method outperforms baselines for skill-generation in several challenging sparse-reward domains, and requires orders-of-magnitude less computation in skill-generation and online rollouts. Our code is available at \url{https://github.com/dyunis/subwords_as_skills}.

Subwords as Skills: Tokenization for Sparse-Reward Reinforcement Learning

TL;DR

Abstract

Paper Structure (31 sections, 1 equation, 16 figures, 1 table, 1 algorithm)

This paper contains 31 sections, 1 equation, 16 figures, 1 table, 1 algorithm.

Introduction
Related Work
Method
Byte-Pair Encoding
Discretizing the Action Space
Merging and Pruning the Subwords
Experiments
Reinforcement Learning with Unconditional Skills
Exploration Behavior on AntMaze Medium
Comparison to Observation-Conditioned Skills
Transferring Skills
Ablations
Number of Discrete Primitives
Subword Length
Vocabulary Size
...and 16 more sections

Figures (16)

Figure 1: A sample of some "skills" that our method identifies for the \ref{['fig:sample-skills-antmaze']} AntMaze and \ref{['fig:sample-skills-kitchen']} Kitchen environments, where color is darker for poses earlier in the trajectory. Skills consist of linear motion and turning in AntMaze, and reaching and pulling motions in Kitchen. Our method discovers a finite inventory of skills, so it is possible to visualize and interpret them.
Figure 2: Abstract representation of our method. Given demonstrations in the same action space as our downstream task, we discretize the actions and apply a tokenization technique to recover "subwords" that form a vocabulary of skills. We then train a policy on top of these skills for a new task. We only require a common action space between demonstrations and the downstream task.
Figure 3: Main comparison (unnormalized scores). SSP corresponds to results from official code of pertsch21, SSP-p corresponds to published results. AntMaze is scored $0$--$1$, Kitchen is scored $0$--$4$ in increments of $1$, CoinRun is scored $0$--$100$ in increments of $10$. CoinRun is a discrete-action domain, so instead of SAC only SAC-discrete can be used. We see strong performance when compared to baselines across tasks.
Figure 4: A visualization of state visitation in online RL on AntMaze Medium in the first $1$ million timesteps for \ref{['fig:visit-sac']} SAC-discrete, \ref{['fig:visit-sfp']} SFP, \ref{['fig:visit-ssp']} SSP, and \ref{['fig:visit-subwords']} our method averaged over $5$ seeds. The grey circle in the bottom-left denotes the start position, while the green circle in the top-right indicates the goal. Notice that our method explores the maze much more extensively, with exploration behavior that is similar for all five seeds. SAC's visitation is tightly concentrated on the start state, which is why there is so little red in \ref{['fig:visit-sac']} the visitation rendering for SAC-discrete (i.e., it is occluded by the gray circle).
Figure 5: Comparison to methods with observation-conditioned skills. In general we see conditioning helps when the data closely overlaps with the downstream task (Kitchen), but not in AntMaze where the demonstrations are somewhat disjoint. OPAL is a closed-source method similar to SPiRL, so results are taken from ajay2020opal.
...and 11 more figures

Subwords as Skills: Tokenization for Sparse-Reward Reinforcement Learning

TL;DR

Abstract

Subwords as Skills: Tokenization for Sparse-Reward Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (16)