Skill-Critic: Refining Learned Skills for Hierarchical Reinforcement Learning

Ce Hao; Catherine Weaver; Chen Tang; Kenta Kawamoto; Masayoshi Tomizuka; Wei Zhan

Skill-Critic: Refining Learned Skills for Hierarchical Reinforcement Learning

Ce Hao, Catherine Weaver, Chen Tang, Kenta Kawamoto, Masayoshi Tomizuka, Wei Zhan

TL;DR

The experiments show that Skill-Critic's low-level policy fine-tuning and demonstration-guided regularization are essential for good performance, and the proposed Skill-Critic algorithm optimizes both the low-level and high-level policies.

Abstract

Hierarchical reinforcement learning (RL) can accelerate long-horizon decision-making by temporally abstracting a policy into multiple levels. Promising results in sparse reward environments have been seen with skills, i.e. sequences of primitive actions. Typically, a skill latent space and policy are discovered from offline data. However, the resulting low-level policy can be unreliable due to low-coverage demonstrations or distribution shifts. As a solution, we propose the Skill-Critic algorithm to fine-tune the low-level policy in conjunction with high-level skill selection. Our Skill-Critic algorithm optimizes both the low-level and high-level policies; these policies are initialized and regularized by the latent space learned from offline demonstrations to guide the parallel policy optimization. We validate Skill-Critic in multiple sparse-reward RL environments, including a new sparse-reward autonomous racing task in Gran Turismo Sport. The experiments show that Skill-Critic's low-level policy fine-tuning and demonstration-guided regularization are essential for good performance. Code and videos are available at our website: https://sites.google.com/view/skill-critic.

Skill-Critic: Refining Learned Skills for Hierarchical Reinforcement Learning

TL;DR

Abstract

Paper Structure (19 sections, 13 equations, 8 figures, 1 table, 2 algorithms)

This paper contains 19 sections, 13 equations, 8 figures, 1 table, 2 algorithms.

Introduction
Related Works
Skill-transfer RL
Hierarchical RL
Approach
Offline Skill Prior and Embedding Pre-Training (Stage 1)
Hierarchical Skill-Prior and Action-Prior Regularized RL Fine-tuning (Stage 2)
Semi-MDP endowed with skills
Formulation as two augmented MDPs
High-MDP Policy Optimization
Low-MDP Policy Optimization
Experiments
Maze Navigation and Trajectory Planning
Autonomous Racing
Robot Manipulation
...and 4 more sections

Figures (8)

Figure 1: Skill-Critic leverages low-coverage demonstrations to facilitate hierarchical reinforcement learning by (1) acquiring a basic skill-set from demonstrations that (2) guides online skill selection and skill improvement.
Figure 2: Hierarchical RL from a demonstration-guided latent space. Left: Offline data informs the skill embedding model with skill encoder (yellow), skill prior (green), and skill decoder (blue). Hyperparameter $\sigma_{\hat{a}}$ is augmented to the decoder to define the action prior. Right: HL (red) and LL (purple) policies are fine-tuned on downstream tasks via our Skill-Critic algorithm. During fine-tuning, the HL and LL policies are regularized with the skill and action priors.
Figure 3: Demonstrations and experiments. (a) Maze Tasks: Stage 1 demonstration uses the planner in SPiRL pertsch2021accelerating. Stage 2 tasks test the agent's navigation in a Diagonal Maze and path planning in a Curvy Tunnel. (b) GTS Racing on a single corner. The agent achieves +1 after the goal state is passed. Demonstrations start at random low-speed starting points on the course. (c) Robotic Manipulation: Stage 1 demonstrations use a hand-crafted controller rana2023residual to push a block across a table. Stage 2 RL tasks are Slippery Push, which uses a more slippery surface, and Cleanup Table, which includes a tray as an obstacle.
Figure 4: Maze results. Left: Rewards. Skill-Critic starts training at $N_{\textrm{HL-warm-up}}$=1M steps. Right: Trajectories after policies converge. SPiRL reuses right-angle skills, but Skill-Critic plans diagonal and curved paths.
Figure 5: GTS Racing Results. Left: mean (std) episode reward. Right: mean (std) of cumulative time in contact with track boundary per episode. SPiRL does not improve, so Skill-Critic does not use warm-up: $N_{\textrm{HL-warm-up}}=0$.
...and 3 more figures

Skill-Critic: Refining Learned Skills for Hierarchical Reinforcement Learning

TL;DR

Abstract

Skill-Critic: Refining Learned Skills for Hierarchical Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)