Table of Contents
Fetching ...

ProMP: Proximal Meta-Policy Search

Jonas Rothfuss, Dennis Lee, Ignasi Clavera, Tamim Asfour, Pieter Abbeel

TL;DR

The paper addresses poor credit assignment to pre-adaptation behavior in gradient-based Meta-RL. It offers a formal analysis comparing two meta-learning formulations, arguing that proper pre-update credit (as in Formulation I) aligns pre- and post-update gradients to optimize adaptation. To tackle high-variance gradient estimates, it introduces the Low Variance Curvature (LVC) estimator and builds Proximal Meta-Policy Search (ProMP), which combines LVC with proximal updates and KL-based constraints. Across MuJoCo tasks, ProMP demonstrates superior sample efficiency, faster training, and better asymptotic performance than prior gradient-based meta-RL methods.

Abstract

Credit assignment in Meta-reinforcement learning (Meta-RL) is still poorly understood. Existing methods either neglect credit assignment to pre-adaptation behavior or implement it naively. This leads to poor sample-efficiency during meta-training as well as ineffective task identification strategies. This paper provides a theoretical analysis of credit assignment in gradient-based Meta-RL. Building on the gained insights we develop a novel meta-learning algorithm that overcomes both the issue of poor credit assignment and previous difficulties in estimating meta-policy gradients. By controlling the statistical distance of both pre-adaptation and adapted policies during meta-policy search, the proposed algorithm endows efficient and stable meta-learning. Our approach leads to superior pre-adaptation policy behavior and consistently outperforms previous Meta-RL algorithms in sample-efficiency, wall-clock time, and asymptotic performance.

ProMP: Proximal Meta-Policy Search

TL;DR

The paper addresses poor credit assignment to pre-adaptation behavior in gradient-based Meta-RL. It offers a formal analysis comparing two meta-learning formulations, arguing that proper pre-update credit (as in Formulation I) aligns pre- and post-update gradients to optimize adaptation. To tackle high-variance gradient estimates, it introduces the Low Variance Curvature (LVC) estimator and builds Proximal Meta-Policy Search (ProMP), which combines LVC with proximal updates and KL-based constraints. Across MuJoCo tasks, ProMP demonstrates superior sample efficiency, faster training, and better asymptotic performance than prior gradient-based meta-RL methods.

Abstract

Credit assignment in Meta-reinforcement learning (Meta-RL) is still poorly understood. Existing methods either neglect credit assignment to pre-adaptation behavior or implement it naively. This leads to poor sample-efficiency during meta-training as well as ineffective task identification strategies. This paper provides a theoretical analysis of credit assignment in gradient-based Meta-RL. Building on the gained insights we develop a novel meta-learning algorithm that overcomes both the issue of poor credit assignment and previous difficulties in estimating meta-policy gradients. By controlling the statistical distance of both pre-adaptation and adapted policies during meta-policy search, the proposed algorithm endows efficient and stable meta-learning. Our approach leads to superior pre-adaptation policy behavior and consistently outperforms previous Meta-RL algorithms in sample-efficiency, wall-clock time, and asymptotic performance.

Paper Structure

This paper contains 30 sections, 63 equations, 8 figures, 1 table, 1 algorithm.

Figures (8)

  • Figure 1: Stochastic computation graphs of meta-learning formulation I (left) and formulation II (right). The red arrows illustrate the credit assignment from the post-update returns $R'$ to the pre-update policy $\pi_\theta$ through $\nabla_\theta J_{\text{pre}}$. (Deterministic nodes: Square; Stochastic nodes: Circle)
  • Figure 2: Meta-learning curves of ProMP and previous gradient-based meta-learning algorithms in six different MuJoCo environments. ProMP outperforms previous work in all the the environments.
  • Figure 3: Meta-learning curves corresponding to different meta-gradient estimators in conjunction with VPG. The introduced LVC approach consistently outperforms the other estimators.
  • Figure 4: Top: Relative standard deviation of meta-policy gradients. Bottom: Returns in the respective environments throughout the learning process. LVC exhibits less variance in its meta-gradients which may explain its superior performance when compared to DiCE.
  • Figure 5: Exploration patterns of the pre-update policy and exploitation post-update with different update functions. Through its superior credit assignment, the LVC objective learns a pre-update policy that is able to identify the current task and respectively adapt its policy, successfully reaching the goal (dark green circle).
  • ...and 3 more figures