Table of Contents
Fetching ...

Constrained Meta Agnostic Reinforcement Learning

Karam Daaboul, Florian Kuhm, Tim Joseph, J. Marius Zoellner

TL;DR

C-MAML enables rapid and efficient task adaptation by incorporating task-specific constraints directly into its meta-algorithm framework during the training phase, which results in safer initial parameters for learning new tasks.

Abstract

Meta-Reinforcement Learning (Meta-RL) aims to acquire meta-knowledge for quick adaptation to diverse tasks. However, applying these policies in real-world environments presents a significant challenge in balancing rapid adaptability with adherence to environmental constraints. Our novel approach, Constraint Model Agnostic Meta Learning (C-MAML), merges meta learning with constrained optimization to address this challenge. C-MAML enables rapid and efficient task adaptation by incorporating task-specific constraints directly into its meta-algorithm framework during the training phase. This fusion results in safer initial parameters for learning new tasks. We demonstrate the effectiveness of C-MAML in simulated locomotion with wheeled robot tasks of varying complexity, highlighting its practicality and robustness in dynamic environments.

Constrained Meta Agnostic Reinforcement Learning

TL;DR

C-MAML enables rapid and efficient task adaptation by incorporating task-specific constraints directly into its meta-algorithm framework during the training phase, which results in safer initial parameters for learning new tasks.

Abstract

Meta-Reinforcement Learning (Meta-RL) aims to acquire meta-knowledge for quick adaptation to diverse tasks. However, applying these policies in real-world environments presents a significant challenge in balancing rapid adaptability with adherence to environmental constraints. Our novel approach, Constraint Model Agnostic Meta Learning (C-MAML), merges meta learning with constrained optimization to address this challenge. C-MAML enables rapid and efficient task adaptation by incorporating task-specific constraints directly into its meta-algorithm framework during the training phase. This fusion results in safer initial parameters for learning new tasks. We demonstrate the effectiveness of C-MAML in simulated locomotion with wheeled robot tasks of varying complexity, highlighting its practicality and robustness in dynamic environments.
Paper Structure (30 sections, 37 equations, 13 figures, 1 algorithm)

This paper contains 30 sections, 37 equations, 13 figures, 1 algorithm.

Figures (13)

  • Figure 1: Visual representation of the Constrained Model Agnostic Meta Learning (C-MAML) framework. This schematic showcases the iterative optimization process where the meta-policy is trained across different tasks. Task-specific policies ($\pi_1, \pi_2, \pi_3$) are adjusted within their respective constraint surfaces $C_1, C_2, C_3$, each with a dedicated safety boundary $d_1, d_2, d_3$.
  • Figure 2: Illustrations of the action space and two different tasks of the used environment.
  • Figure 3: Evaluation of $\eta$ on policy safety and adaptability: On the left, meta-training performance across 106 tasks, showing the effect of an adaptive $\eta$ (employing safety critic) versus $\eta = 0$ (no safety critic) on maintaining safer cost margins. On the right, fine-tuning phase performance, illustrating how an adaptive $\eta$ contributes to consistent adherence to the $d=5$ cost threshold compared to the absence of a safety critic.
  • Figure 4: Mean episode return and costs during fine-tuning across tasks. Policies are color-coded as follows: C-MAML with TRPOLag in the inner loop is depicted in blue, the randomly initialized policy in orange, the TRPOLag pretrained policy in green, and the MAML policy with TRPO in the inner loop is shown in red, highlighting the diverse adaptation strategies explored.
  • Figure 5: Mean episode return and costs during fine-tuning across tasks: C-MAML with CPO in the inner loop is depicted in blue, the randomly initialized policy in orange, CPO pretrained in green, and the MAML policy with TRPO in the inner loop is shown in red. Each of these policies (initialization) is fine-tuned using CPO.
  • ...and 8 more figures