Table of Contents
Fetching ...

EXTRACT: Efficient Policy Learning by Extracting Transferable Robot Skills from Offline Data

Jesse Zhang, Minho Heo, Zuxin Liu, Erdem Biyik, Joseph J Lim, Yao Liu, Rasool Fakoor

TL;DR

This work demonstrates through experiments in sparse-reward, image-based, robot manipulation environments that EXTRACT can more quickly learn new tasks than prior works, with major gains in sample efficiency and performance over prior skill-based RL.

Abstract

Most reinforcement learning (RL) methods focus on learning optimal policies over low-level action spaces. While these methods can perform well in their training environments, they lack the flexibility to transfer to new tasks. Instead, RL agents that can act over useful, temporally extended skills rather than low-level actions can learn new tasks more easily. Prior work in skill-based RL either requires expert supervision to define useful skills, which is hard to scale, or learns a skill-space from offline data with heuristics that limit the adaptability of the skills, making them difficult to transfer during downstream RL. Our approach, EXTRACT, instead utilizes pre-trained vision language models to extract a discrete set of semantically meaningful skills from offline data, each of which is parameterized by continuous arguments, without human supervision. This skill parameterization allows robots to learn new tasks by only needing to learn when to select a specific skill and how to modify its arguments for the specific task. We demonstrate through experiments in sparse-reward, image-based, robot manipulation environments that EXTRACT can more quickly learn new tasks than prior works, with major gains in sample efficiency and performance over prior skill-based RL. Website at https://www.jessezhang.net/projects/extract/.

EXTRACT: Efficient Policy Learning by Extracting Transferable Robot Skills from Offline Data

TL;DR

This work demonstrates through experiments in sparse-reward, image-based, robot manipulation environments that EXTRACT can more quickly learn new tasks than prior works, with major gains in sample efficiency and performance over prior skill-based RL.

Abstract

Most reinforcement learning (RL) methods focus on learning optimal policies over low-level action spaces. While these methods can perform well in their training environments, they lack the flexibility to transfer to new tasks. Instead, RL agents that can act over useful, temporally extended skills rather than low-level actions can learn new tasks more easily. Prior work in skill-based RL either requires expert supervision to define useful skills, which is hard to scale, or learns a skill-space from offline data with heuristics that limit the adaptability of the skills, making them difficult to transfer during downstream RL. Our approach, EXTRACT, instead utilizes pre-trained vision language models to extract a discrete set of semantically meaningful skills from offline data, each of which is parameterized by continuous arguments, without human supervision. This skill parameterization allows robots to learn new tasks by only needing to learn when to select a specific skill and how to modify its arguments for the specific task. We demonstrate through experiments in sparse-reward, image-based, robot manipulation environments that EXTRACT can more quickly learn new tasks than prior works, with major gains in sample efficiency and performance over prior skill-based RL. Website at https://www.jessezhang.net/projects/extract/.

Paper Structure

This paper contains 43 sections, 5 equations, 19 figures, 1 table, 4 algorithms.

Figures (19)

  • Figure 1: EXTRACT unsupervisedly extracts a discrete set of skills from offline data that can be used for efficient learning of new tasks. (1) EXTRACT first uses VLMs to extract a discrete set of aligned skills from image-action data. (2) EXTRACT then trains a skill decoder to output low-level actions given discrete skill IDs and learned continuous arguments. (3) This decoder helps a skill-based policy efficiently learn new tasks with a simplified action space over skill IDs and arguments.
  • Figure 2: EXTRACT consists of three phases. (1) Skill Extraction: We extract a discrete set of skills from offline data by clustering together visual VLM difference embeddings representing high-level behaviors. (2) Skill Learning: We train a skill decoder model, $p_{a}(\bar{a}\mid z, d)$, to output variable-length action sequences conditioned on a skill ID $d$ and a learned continuous argument $z$. The argument $z$ is learned by training $p_{a}(\bar{a}\mid z, d)$ with a VAE reconstruction objective from action sequences encoded by a skill encoder, $q(z\mid\bar{a}, d)$. We additionally train a skill selection prior and skill argument prior $p_d(d \mid s)$, $p_z(z\mid s, d)$ to predict which skills $d$ and their arguments $z$ are useful for a given state $s$. Colorful arrows indicate gradients from reconstruction, argument prior, selection prior, and VAE losses. (3) Online RL: To learn a new task, we train a skill selection and skill argument policy with RL while regularizing them with the skill selection and skill argument priors.
  • Figure 3: Skill label assignment consists of (1) using the VLM embedding differences for clustering, then (2) applying a median filter over the labels to smooth out noisy assignments.
  • Figure 4: 100 randomly sampled trajectories from the Franka Kitchen dataset after being clustered into skills and visualized in 2D (originally 2048) with PCA. Even in 2 dimensions, clusters can be clearly distinguished. We visualize 2 randomly sampled skills in each cluster, demonstrating that our skill assignment mechanism successfully aligns trajectories performing similar high-level behaviors.
  • Figure 5: EXTRACT outperforms SPiRL and EXTRACT-UVD in RL across all comparisons, demonstrating the advantages of our clustered skill-space. SAC and BC struggle, demonstrating the need for skill-based RL. In LIBERO-{Object, Spatial, Goal}, return is success rate.
  • ...and 14 more figures