Table of Contents
Fetching ...

Natural Policy Gradient and Actor Critic Methods for Constrained Multi-Task Reinforcement Learning

Sihan Zeng, Thinh T. Doan, Justin Romberg

TL;DR

This work introduces constrained multi-task reinforcement learning (RL), formalizing the problem as maximizing the average task return $V_0^\ ext{\pi}( ho)$ under per-task bounds $\ell_i\le V_i^{\pi}(\rho)\le u_i$ within both centralized and decentralized settings. It develops a family of primal-dual, policy-gradient-based algorithms: a centralized MT-PDNPG with exact gradients achieving ${\cal O}(K^{-1/2})$ convergence, and a fully online MT-PDNAC that attains ${\cal O}(K^{-1/6})$ with a single trajectory; these extend to decentralized graphs with consensus and similar rates influenced by the graph’s spectral gap. To handle large or continuous state spaces, the authors extend to linear function approximation via a nested-loop architecture that preserves the ${\widetilde{\cal O}}(\delta^{-6})$ rate up to approximation error, and provide finite-sample guarantees under standard mixing assumptions. Numerical experiments on a three-task GridWorld demonstrate the practical ability to enforce per-task constraints while balancing overall performance, validating the proposed methods as scalable and online-friendly solutions for constrained multi-task RL.

Abstract

Multi-task reinforcement learning (RL) aims to find a single policy that effectively solves multiple tasks at the same time. This paper presents a constrained formulation for multi-task RL where the goal is to maximize the average performance of the policy across tasks subject to bounds on the performance in each task. We consider solving this problem both in the centralized setting, where information for all tasks is accessible to a single server, and in the decentralized setting, where a network of agents, each given one task and observing local information, cooperate to find the solution of the globally constrained objective using local communication. We first propose a primal-dual algorithm that provably converges to the globally optimal solution of this constrained formulation under exact gradient evaluations. When the gradient is unknown, we further develop a sampled-based actor-critic algorithm that finds the optimal policy using online samples of state, action, and reward. Finally, we study the extension of the algorithm to the linear function approximation setting.

Natural Policy Gradient and Actor Critic Methods for Constrained Multi-Task Reinforcement Learning

TL;DR

This work introduces constrained multi-task reinforcement learning (RL), formalizing the problem as maximizing the average task return under per-task bounds within both centralized and decentralized settings. It develops a family of primal-dual, policy-gradient-based algorithms: a centralized MT-PDNPG with exact gradients achieving convergence, and a fully online MT-PDNAC that attains with a single trajectory; these extend to decentralized graphs with consensus and similar rates influenced by the graph’s spectral gap. To handle large or continuous state spaces, the authors extend to linear function approximation via a nested-loop architecture that preserves the rate up to approximation error, and provide finite-sample guarantees under standard mixing assumptions. Numerical experiments on a three-task GridWorld demonstrate the practical ability to enforce per-task constraints while balancing overall performance, validating the proposed methods as scalable and online-friendly solutions for constrained multi-task RL.

Abstract

Multi-task reinforcement learning (RL) aims to find a single policy that effectively solves multiple tasks at the same time. This paper presents a constrained formulation for multi-task RL where the goal is to maximize the average performance of the policy across tasks subject to bounds on the performance in each task. We consider solving this problem both in the centralized setting, where information for all tasks is accessible to a single server, and in the decentralized setting, where a network of agents, each given one task and observing local information, cooperate to find the solution of the globally constrained objective using local communication. We first propose a primal-dual algorithm that provably converges to the globally optimal solution of this constrained formulation under exact gradient evaluations. When the gradient is unknown, we further develop a sampled-based actor-critic algorithm that finds the optimal policy using online samples of state, action, and reward. Finally, we study the extension of the algorithm to the linear function approximation setting.
Paper Structure (35 sections, 20 theorems, 164 equations, 2 figures, 5 algorithms)

This paper contains 35 sections, 20 theorems, 164 equations, 2 figures, 5 algorithms.

Key Result

Lemma 1

Under Assumption assump:Slater, we have where $B_{\lambda}=\frac{1}{\xi(1-\gamma)}$

Figures (2)

  • Figure 1: Reward in maze 1 (left): 10 for reaching target, -5 for crossing the second and third bridge, -1 for crossing the fourth bridge, and -0.1 for any other move. Reward in maze 2 (middle): 100 for reaching target, -50 for crossing the first and third bridge, -10 for crossing the fourth bridge, and -1 for any other move. Reward in maze 3 (right): 1000 for reaching target, -500 for crossing the second and the third bridge, -100 for crossing the fourth bridge, and -10 for any other move. Green dotted lines indicate optimal paths for local tasks. The yellow dotted line indicates a sub-optimal but acceptable policy for each local task, which is also the globally optimal policy of the constrained multi-task problem under $\ell_1=5$, $\ell_2=50$, $\ell_3=500$.
  • Figure 2: Left -- convergence of Algorithm \ref{['Alg:MT-PDNPG']} in objective function with $\ell_1=5$, $\ell_2=50$, $\ell_3=500$ (constrained policy), and with $\ell_1=\ell_2=\ell_3=-\infty$ (unconstrained policy). Middle -- convergence of Algorithm \ref{['Alg:MT-PDNPG']} in constraint violation with $\ell_1=5$, $\ell_2=50$, $\ell_3=500$. Right -- convergence of Algorithm \ref{['Alg:MT-PDNPG']} in constraint violation with $\ell_1=\ell_2=\ell_3=-\infty$.

Theorems & Definitions (22)

  • Lemma 1
  • Corollary 1
  • Definition 1
  • Corollary 2
  • Theorem 1
  • Theorem 2
  • Remark 1
  • Theorem 3
  • Proposition 1
  • Lemma 2: Lemma 8 of khodadadian2022finite
  • ...and 12 more