Table of Contents
Fetching ...

Meta-Reinforcement Learning with Universal Policy Adaptation: Provable Near-Optimality under All-task Optimum Comparator

Siyuan Xu, Minghui Zhu

TL;DR

A bilevel optimization framework for meta-RL (BO-MRL) to learn the meta-prior for task-specific policy adaptation, which implements multiple-step policy optimization on one-time data collection and provides upper bounds of the expected optimality gap over the task distribution.

Abstract

Meta-reinforcement learning (Meta-RL) has attracted attention due to its capability to enhance reinforcement learning (RL) algorithms, in terms of data efficiency and generalizability. In this paper, we develop a bilevel optimization framework for meta-RL (BO-MRL) to learn the meta-prior for task-specific policy adaptation, which implements multiple-step policy optimization on one-time data collection. Beyond existing meta-RL analyses, we provide upper bounds of the expected optimality gap over the task distribution. This metric measures the distance of the policy adaptation from the learned meta-prior to the task-specific optimum, and quantifies the model's generalizability to the task distribution. We empirically validate the correctness of the derived upper bounds and demonstrate the superior effectiveness of the proposed algorithm over benchmarks.

Meta-Reinforcement Learning with Universal Policy Adaptation: Provable Near-Optimality under All-task Optimum Comparator

TL;DR

A bilevel optimization framework for meta-RL (BO-MRL) to learn the meta-prior for task-specific policy adaptation, which implements multiple-step policy optimization on one-time data collection and provides upper bounds of the expected optimality gap over the task distribution.

Abstract

Meta-reinforcement learning (Meta-RL) has attracted attention due to its capability to enhance reinforcement learning (RL) algorithms, in terms of data efficiency and generalizability. In this paper, we develop a bilevel optimization framework for meta-RL (BO-MRL) to learn the meta-prior for task-specific policy adaptation, which implements multiple-step policy optimization on one-time data collection. Beyond existing meta-RL analyses, we provide upper bounds of the expected optimality gap over the task distribution. This metric measures the distance of the policy adaptation from the learned meta-prior to the task-specific optimum, and quantifies the model's generalizability to the task distribution. We empirically validate the correctness of the derived upper bounds and demonstrate the superior effectiveness of the proposed algorithm over benchmarks.

Paper Structure

This paper contains 43 sections, 30 theorems, 235 equations, 5 figures, 1 table, 2 algorithms.

Key Result

Proposition 1

For the tabular policy in the discrete state-action space, consider any meta-parameter $\phi$ and the within-task algorithm (dis_withintask). Let $\pi_{\theta^{\prime}_{\tau}}=\mathcal{A} l g(\pi_\phi, \lambda, \tau)$. If $M(s)\triangleq\lambda\nabla_{\pi(\cdot | s)}^{2} d^2(\pi_\phi(\cdot|s),\pi(\c

Figures (5)

  • Figure 1: Results of the meta-test on Frozen Lake, where $\mathcal{A} l g^{(1)}$ is applied. Left: Average accumulated reward across all test tasks v.s. number of policy adaptation steps; Right: Comparing the expected optimality gap by the BO-MRL and baselines with the upper bound of the accumulated reward of one-time $\mathcal{A} l g^{(1)}$.
  • Figure 2: Average accumulated reward across all test tasks during the meta-test under the practical algorithm of BO-MRL on the locomotion tasks.
  • Figure 3: Results of the meta-test of BO-MRL on Frozen Lake, where $\mathcal{A} l g^{(2)}$ is applied. Left: Average accumulated reward across all test tasks v.s. number of policy adaptation steps; Right: Comparing the expected optimality gap by the BO-MRL and baselines with the upper bound of the accumulated reward of one-time $\mathcal{A} l g^{(2)}$.
  • Figure 4: Results of BO-MRL on Frozen Lake, where $\mathcal{A} l g^{(3)}$ is applied. Comparing the expected optimality gap by the BO-MRL and baselines with the upper bound of the accumulated reward of one-time $\mathcal{A} l g^{(3)}$.
  • Figure 5: Accumulated rewards during the meta-training under the practical algorithm of BO-MRL on the locomotion tasks.

Theorems & Definitions (59)

  • Proposition 1: Hypergradient for the tabular policy
  • Proposition 2: Hypergradient for the policy with function approximation
  • Remark 1
  • Proposition 3: Existence of hypergradient for the policy with function approximation
  • Theorem 1: Convergence guarantee for tabular softmax policy
  • Theorem 2: Convergence guarantee for softmax policy with function approximation
  • Lemma 1
  • Lemma 2
  • Theorem 3: Optimality guarantee for softmax tabular policy
  • Theorem 4: Optimality guarantee for softmax policy with function approximation
  • ...and 49 more