Table of Contents
Fetching ...

Self-Improving Skill Learning for Robust Skill-based Meta-Reinforcement Learning

Sanghyeon Lee, Sangjun Bae, Yisak Park, Seungyul Han

TL;DR

SISL tackles the instability of skill-based meta-RL under noisy offline data by decoupling a high-level policy from a dedicated skill-improvement policy and by prioritizing updates on task-relevant trajectories through maximum return relabeling. A reward-model guided relabeling scheme concentrates learning on promising offline samples, while an online improvement process progressively denoisess the skill library. Empirical results across Kitchen, Office, Maze2D, and AntMaze demonstrate that SISL robustly outperforms existing skill-based meta-RL approaches, especially as demonstration noise increases, with an acceptable computational overhead. The work offers a practical, data-efficient path to robust long-horizon meta-learning in real-world settings where data quality cannot be guaranteed.

Abstract

Meta-reinforcement learning (Meta-RL) facilitates rapid adaptation to unseen tasks but faces challenges in long-horizon environments. Skill-based approaches tackle this by decomposing state-action sequences into reusable skills and employing hierarchical decision-making. However, these methods are highly susceptible to noisy offline demonstrations, leading to unstable skill learning and degraded performance. To address this, we propose Self-Improving Skill Learning (SISL), which performs self-guided skill refinement using decoupled high-level and skill improvement policies, while applying skill prioritization via maximum return relabeling to focus updates on task-relevant trajectories, resulting in robust and stable adaptation even under noisy and suboptimal data. By mitigating the effect of noise, SISL achieves reliable skill learning and consistently outperforms other skill-based meta-RL methods on diverse long-horizon tasks.

Self-Improving Skill Learning for Robust Skill-based Meta-Reinforcement Learning

TL;DR

SISL tackles the instability of skill-based meta-RL under noisy offline data by decoupling a high-level policy from a dedicated skill-improvement policy and by prioritizing updates on task-relevant trajectories through maximum return relabeling. A reward-model guided relabeling scheme concentrates learning on promising offline samples, while an online improvement process progressively denoisess the skill library. Empirical results across Kitchen, Office, Maze2D, and AntMaze demonstrate that SISL robustly outperforms existing skill-based meta-RL approaches, especially as demonstration noise increases, with an acceptable computational overhead. The work offers a practical, data-efficient path to robust long-horizon meta-learning in real-world settings where data quality cannot be guaranteed.

Abstract

Meta-reinforcement learning (Meta-RL) facilitates rapid adaptation to unseen tasks but faces challenges in long-horizon environments. Skill-based approaches tackle this by decomposing state-action sequences into reusable skills and employing hierarchical decision-making. However, these methods are highly susceptible to noisy offline demonstrations, leading to unstable skill learning and degraded performance. To address this, we propose Self-Improving Skill Learning (SISL), which performs self-guided skill refinement using decoupled high-level and skill improvement policies, while applying skill prioritization via maximum return relabeling to focus updates on task-relevant trajectories, resulting in robust and stable adaptation even under noisy and suboptimal data. By mitigating the effect of noise, SISL achieves reliable skill learning and consistently outperforms other skill-based meta-RL methods on diverse long-horizon tasks.

Paper Structure

This paper contains 41 sections, 13 equations, 21 figures, 10 tables, 2 algorithms.

Figures (21)

  • Figure 1: Sample trajectories in the Maze2D environment: (a) Noisy demonstrations from the offline dataset, (b) Trajectories explored by the exploration policy near the noisy dataset to uncover useful skills, and (c) Trajectories utilizing refined skills to solve unseen test tasks
  • Figure 2: Comparison of prior skill learning methods in microwave-opening task: (a) Learned skills with expert and noisy demonstrations. (b) Meta-RL performance with the learned skills.
  • Figure 3: The SISL framework
  • Figure 4: Illustration of maximum return relabeling
  • Figure 5: Considered long-horizon environments
  • ...and 16 more figures