Table of Contents
Fetching ...

Reset-Free Lifelong Learning with Skill-Space Planning

Kevin Lu, Aditya Grover, Pieter Abbeel, Igor Mordatch

TL;DR

This paper tackles reset-free lifelong reinforcement learning in non-stationary environments by proposing Lifelong Skill Planning (LiSP), which learns a diverse set of latent skills and plans over them using a learned dynamics model and Model Predictive Control. LiSP blends online policy optimization with offline skill discovery, aided by a skill-practice curriculum and an intrinsic reward derived from mutual information (DADS), and extends offline learning via a model-disagreement penalty to stay within dataset support. The approach enables long-horizon reasoning and safe acting, demonstrated on non-episodic gridworlds and MuJoCo tasks, where LiSP outperforms strong baselines and reduces failures due to sink states. Additionally, LiSP supports learning skills entirely from offline data and offers insights into when planning in skill space provides advantages over planning in action space. Overall, LiSP provides a scalable, planning-driven framework for robust, reset-free lifelong adaptation with potential for offline data exploitation in complex, changing environments.

Abstract

The objective of lifelong reinforcement learning (RL) is to optimize agents which can continuously adapt and interact in changing environments. However, current RL approaches fail drastically when environments are non-stationary and interactions are non-episodic. We propose Lifelong Skill Planning (LiSP), an algorithmic framework for non-episodic lifelong RL based on planning in an abstract space of higher-order skills. We learn the skills in an unsupervised manner using intrinsic rewards and plan over the learned skills using a learned dynamics model. Moreover, our framework permits skill discovery even from offline data, thereby reducing the need for excessive real-world interactions. We demonstrate empirically that LiSP successfully enables long-horizon planning and learns agents that can avoid catastrophic failures even in challenging non-stationary and non-episodic environments derived from gridworld and MuJoCo benchmarks.

Reset-Free Lifelong Learning with Skill-Space Planning

TL;DR

This paper tackles reset-free lifelong reinforcement learning in non-stationary environments by proposing Lifelong Skill Planning (LiSP), which learns a diverse set of latent skills and plans over them using a learned dynamics model and Model Predictive Control. LiSP blends online policy optimization with offline skill discovery, aided by a skill-practice curriculum and an intrinsic reward derived from mutual information (DADS), and extends offline learning via a model-disagreement penalty to stay within dataset support. The approach enables long-horizon reasoning and safe acting, demonstrated on non-episodic gridworlds and MuJoCo tasks, where LiSP outperforms strong baselines and reduces failures due to sink states. Additionally, LiSP supports learning skills entirely from offline data and offers insights into when planning in skill space provides advantages over planning in action space. Overall, LiSP provides a scalable, planning-driven framework for robust, reset-free lifelong adaptation with potential for offline data exploitation in complex, changing environments.

Abstract

The objective of lifelong reinforcement learning (RL) is to optimize agents which can continuously adapt and interact in changing environments. However, current RL approaches fail drastically when environments are non-stationary and interactions are non-episodic. We propose Lifelong Skill Planning (LiSP), an algorithmic framework for non-episodic lifelong RL based on planning in an abstract space of higher-order skills. We learn the skills in an unsupervised manner using intrinsic rewards and plan over the learned skills using a learned dynamics model. Moreover, our framework permits skill discovery even from offline data, thereby reducing the need for excessive real-world interactions. We demonstrate empirically that LiSP successfully enables long-horizon planning and learns agents that can avoid catastrophic failures even in challenging non-stationary and non-episodic environments derived from gridworld and MuJoCo benchmarks.

Paper Structure

This paper contains 30 sections, 5 equations, 13 figures, 2 tables, 3 algorithms.

Figures (13)

  • Figure 1: RL without planning fails without resets. Each line is one seed. The red line shows reward with no updates (i.e. frozen weights).
  • Figure 2: Schematic for Lifelong Skill Planning (LiSP). LiSP learns a set of skills using synthetic model rollouts and performs long-horizon planning in the skill-space for stable, safe lifelong acting.
  • Figure 3: Learning without resets. Vertical lines denote task changes. Each blue line represents one seed of LiSP out of 5; the other algorithms have lower variance (since they fail), so we only show the mean of 3 seeds. Performance is normalized against 1 (for more details, see Appendix \ref{['sec:performance']}).
  • Figure 4: 2D Volcano environment.
  • Figure 5: $\ell_2$ error in the next state prediction of the dynamics model $f_\phi$ for actions sampled from the skill policy vs. uniformly at random. The error variance is greatly reduced by only using skills.
  • ...and 8 more figures