Table of Contents
Fetching ...

Hierarchical Meta-Reinforcement Learning via Automated Macro-Action Discovery

Minjae Cho, Chuangchuang Sun

TL;DR

Meta-RL struggles with fast adaptation across complex, high-dimensional tasks. The authors propose HiMeta, a tri-level hierarchy that learns task representations, discovers task-agnostic macro-actions via a modified VAE bridging states to goals, and learns primitive actions with PPO, with independent training to avoid the curse of hierarchy. The approach reduces the information load on the policy and enables re-use of macro-actions across tasks, achieving superior sample efficiency and higher success rates on MetaWorld ML10 compared with SD and PEARL. This work advances scalable, few-shot adaptation for complex multi-task settings and suggests future extensions to offline, safe, and multi-modal RL.

Abstract

Meta-Reinforcement Learning (Meta-RL) enables fast adaptation to new testing tasks. Despite recent advancements, it is still challenging to learn performant policies across multiple complex and high-dimensional tasks. To address this, we propose a novel architecture with three hierarchical levels for 1) learning task representations, 2) discovering task-agnostic macro-actions in an automated manner, and 3) learning primitive actions. The macro-action can guide the low-level primitive policy learning to more efficiently transition to goal states. This can address the issue that the policy may forget previously learned behavior while learning new, conflicting tasks. Moreover, the task-agnostic nature of the macro-actions is enabled by removing task-specific components from the state space. Hence, this makes them amenable to re-composition across different tasks and leads to promising fast adaptation to new tasks. Also, the prospective instability from the tri-level hierarchies is effectively mitigated by our innovative, independently tailored training schemes. Experiments in the MetaWorld framework demonstrate the improved sample efficiency and success rate of our approach compared to previous state-of-the-art methods.

Hierarchical Meta-Reinforcement Learning via Automated Macro-Action Discovery

TL;DR

Meta-RL struggles with fast adaptation across complex, high-dimensional tasks. The authors propose HiMeta, a tri-level hierarchy that learns task representations, discovers task-agnostic macro-actions via a modified VAE bridging states to goals, and learns primitive actions with PPO, with independent training to avoid the curse of hierarchy. The approach reduces the information load on the policy and enables re-use of macro-actions across tasks, achieving superior sample efficiency and higher success rates on MetaWorld ML10 compared with SD and PEARL. This work advances scalable, few-shot adaptation for complex multi-task settings and suggests future extensions to offline, safe, and multi-modal RL.

Abstract

Meta-Reinforcement Learning (Meta-RL) enables fast adaptation to new testing tasks. Despite recent advancements, it is still challenging to learn performant policies across multiple complex and high-dimensional tasks. To address this, we propose a novel architecture with three hierarchical levels for 1) learning task representations, 2) discovering task-agnostic macro-actions in an automated manner, and 3) learning primitive actions. The macro-action can guide the low-level primitive policy learning to more efficiently transition to goal states. This can address the issue that the policy may forget previously learned behavior while learning new, conflicting tasks. Moreover, the task-agnostic nature of the macro-actions is enabled by removing task-specific components from the state space. Hence, this makes them amenable to re-composition across different tasks and leads to promising fast adaptation to new tasks. Also, the prospective instability from the tri-level hierarchies is effectively mitigated by our innovative, independently tailored training schemes. Experiments in the MetaWorld framework demonstrate the improved sample efficiency and success rate of our approach compared to previous state-of-the-art methods.

Paper Structure

This paper contains 8 sections, 21 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: This overview outlines our approach. Our higher-level architecture, HiMeta, offers high-level directional action predictions to the policy, on which the policy generates primitive actions based on these directions. Additionally, the directional action prediction is trained using a modified VAE designed to link the current state to the desired goal state. This enables the effective distribution of decision-making tasks, facilitating solutions for complex, high-dimensional multi-task and meta-learning scenarios.
  • Figure 2: Our algorithm consists of three hierarchical layers: high (task representation learning), intermediate (macro-action discovery), and low (primitive actions discovery). The high-level layer discovers a task representation $y$, using a recurrent unit, simultaneously with value-function training. The intermediate layer analyzes the representation given the current state to determine the macro-action $z$ which is the (+/-) sign of actions. This is trained via VAE using the missing information technique. The state element $s^{ego}$ is a subset of the state set, $\mathcal{S}^{ego} \in \mathcal{S}$ that includes everything except the agent's self-state. The subsequent loss in decoder with $s^{ego}$ will shape macro-actions, $z$, task-agnostic and compact. The representation and macro-action are then conditioned in the low-level policy to make primitive decisions. Gradients do not overflow between hierarchical layers, ensuring each layer's independent role.
  • Figure 3: Reward shaping function with $a=3$. Rewards below 3 is exponential, while the gradient is linear beyond $a=3$.
  • Figure : (a) Success Metric
  • Figure : (a) Success Metric
  • ...and 1 more figures

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Remark 1