Table of Contents
Fetching ...

Toward Finding Strong Pareto Optimal Policies in Multi-Agent Reinforcement Learning

Bang Giang Le, Viet Cuong Ta

TL;DR

Theoretically, it is proved that MGDA++ converges to strong Pareto optimal solutions in convex, smooth bi-objective problems and the results highlight that the proposed method can converge efficiently and outperform the other methods in terms of the optimality of the convergent policies.

Abstract

In this work, we study the problem of finding Pareto optimal policies in multi-agent reinforcement learning problems with cooperative reward structures. We show that any algorithm where each agent only optimizes their reward is subject to suboptimal convergence. Therefore, to achieve Pareto optimality, agents have to act altruistically by considering the rewards of others. This observation bridges the multi-objective optimization framework and multi-agent reinforcement learning together. We first propose a framework for applying the Multiple Gradient Descent algorithm (MGDA) for learning in multi-agent settings. We further show that standard MGDA is subjected to weak Pareto convergence, a problem that is often overlooked in other learning settings but is prevalent in multi-agent reinforcement learning. To mitigate this issue, we propose MGDA++, an improvement of the existing algorithm to handle the weakly optimal convergence of MGDA properly. Theoretically, we prove that MGDA++ converges to strong Pareto optimal solutions in convex, smooth bi-objective problems. We further demonstrate the superiority of our MGDA++ in cooperative settings in the Gridworld benchmark. The results highlight that our proposed method can converge efficiently and outperform the other methods in terms of the optimality of the convergent policies. The source code is available at \url{https://github.com/giangbang/Strong-Pareto-MARL}.

Toward Finding Strong Pareto Optimal Policies in Multi-Agent Reinforcement Learning

TL;DR

Theoretically, it is proved that MGDA++ converges to strong Pareto optimal solutions in convex, smooth bi-objective problems and the results highlight that the proposed method can converge efficiently and outperform the other methods in terms of the optimality of the convergent policies.

Abstract

In this work, we study the problem of finding Pareto optimal policies in multi-agent reinforcement learning problems with cooperative reward structures. We show that any algorithm where each agent only optimizes their reward is subject to suboptimal convergence. Therefore, to achieve Pareto optimality, agents have to act altruistically by considering the rewards of others. This observation bridges the multi-objective optimization framework and multi-agent reinforcement learning together. We first propose a framework for applying the Multiple Gradient Descent algorithm (MGDA) for learning in multi-agent settings. We further show that standard MGDA is subjected to weak Pareto convergence, a problem that is often overlooked in other learning settings but is prevalent in multi-agent reinforcement learning. To mitigate this issue, we propose MGDA++, an improvement of the existing algorithm to handle the weakly optimal convergence of MGDA properly. Theoretically, we prove that MGDA++ converges to strong Pareto optimal solutions in convex, smooth bi-objective problems. We further demonstrate the superiority of our MGDA++ in cooperative settings in the Gridworld benchmark. The results highlight that our proposed method can converge efficiently and outperform the other methods in terms of the optimality of the convergent policies. The source code is available at \url{https://github.com/giangbang/Strong-Pareto-MARL}.

Paper Structure

This paper contains 12 sections, 5 theorems, 23 equations, 6 figures, 1 table, 1 algorithm.

Key Result

Lemma 1

When the objective functions are convex, any Pareto stationary points $\hat{x}$ are weak Pareto optimal solutions.

Figures (6)

  • Figure 1: Example of the matrix game, the tuple in each table cell contains the reward of agent 1 and 2, respectively. There is one Pareto Optimal solution at $(A, A)$. Note that in this example, any policy profile, stochastic or deterministic, is a Nash Equilibrium.
  • Figure 2: Reward landscape in 2D space.
  • Figure 3: Comparison of MGDA, MGDA with Adam and MGDA++. Left: MGDA gets stuck at Pareto Stationary points, where the learning completely stops. Middle: Changing the optimizer does not help in avoiding the suboptimal convergence. Right: MGDA++ is able to converge to strong Pareto Optimal solutions while avoiding being trapped at Pareto Stationary points.
  • Figure 4: Comparison of MGDA and MGDA++ convergent points. Left: We test MGDA and MGDA++ on the synthetic problem from lin2019pareto. Both algorithms can converge to different Pareto optimal solutions in most usual cases. Right: We initiate the two algorithms with the same starting point in a simple quadratic bi-objective optimization problem, $F_{1, 2}(x)=\|x\pm \mathbf{1}\|^2$. While MGDA stops as soon as it reaches the first stationary point, MGDA++ avoids small gradient norm solutions by further taking an additional step into the relative interior of the Pareto Set. In this example, MGDA++ does not converge to balls with radius $\epsilon/2$ around stationary points whose gradient norms of one of the objectives equal 0. For visualization, we plot such a ball with doubled radius $\epsilon=0.01$ as a blue-dotted empty circle in the figure.
  • Figure 5: Four scenarios of the Gridworld environment.
  • ...and 1 more figures

Theorems & Definitions (15)

  • Definition 1: Pareto stationary
  • Definition 2
  • Definition 3
  • Definition 4
  • Lemma 1: Theorem 3.3, zeng2019convergence
  • Corollary 2
  • Lemma 3
  • proof
  • Proposition 4
  • proof : Proof of Proposition \ref{['prop:small_eps']}
  • ...and 5 more