Table of Contents
Fetching ...

Policy Optimization and Multi-agent Reinforcement Learning for Mean-variance Team Stochastic Games

Junkai Hu, Li Xia

TL;DR

The paper addresses risk-sensitive cooperative multi-agent planning under mean-variance objectives in infinite-horizon stochastic games. It develops a sensitivity-based optimization framework that yields explicit performance difference and derivative formulas, enabling sequential, monotone policy updates via MV-MAPI and analysis of stationary-point geometry. To scale to unknown environments, it extends trust-region policy optimization to MV-TSGs (MV-MATRPO) with per-agent surrogates and a joint improvement bound. Empirical validation on energy management in multi-microgrid systems demonstrates meaningful mean-variance trade-offs, validates monotonic improvement, and highlights the role of update order and initial policies in achieving high-quality local optima. Overall, the work provides the first theoretically-grounded, scalable algorithms with guarantees for risk-sensitive cooperative MARL and shows practical impact in distributed energy management settings.

Abstract

We study a long-run mean-variance team stochastic game (MV-TSG), where each agent shares a common mean-variance objective for the system and takes actions independently to maximize it. MV-TSG has two main challenges. First, the variance metric is neither additive nor Markovian in a dynamic setting. Second, simultaneous policy updates of all agents lead to a non-stationary environment for each individual agent. Both challenges make dynamic programming inapplicable. In this paper, we study MV-TSGs from the perspective of sensitivity-based optimization. The performance difference and performance derivative formulas for joint policies are derived, which provide optimization information for MV-TSGs. We prove the existence of a deterministic Nash policy for this problem. Subsequently, we propose a Mean-Variance Multi-Agent Policy Iteration (MV-MAPI) algorithm with a sequential update scheme, where individual agent policies are updated one by one in a given order. We prove that the MV-MAPI algorithm converges to a first-order stationary point of the objective function. By analyzing the local geometry of stationary points, we derive specific conditions for stationary points to be (local) Nash equilibria, and further, strict local optima. To solve large-scale MV-TSGs in scenarios with unknown environmental parameters, we extend the idea of trust region methods to MV-MAPI and develop a multi-agent reinforcement learning algorithm named Mean-Variance Multi-Agent Trust Region Policy Optimization (MV-MATRPO). We derive a performance lower bound for each update of joint policies. Finally, numerical experiments on energy management in multiple microgrid systems are conducted.

Policy Optimization and Multi-agent Reinforcement Learning for Mean-variance Team Stochastic Games

TL;DR

The paper addresses risk-sensitive cooperative multi-agent planning under mean-variance objectives in infinite-horizon stochastic games. It develops a sensitivity-based optimization framework that yields explicit performance difference and derivative formulas, enabling sequential, monotone policy updates via MV-MAPI and analysis of stationary-point geometry. To scale to unknown environments, it extends trust-region policy optimization to MV-TSGs (MV-MATRPO) with per-agent surrogates and a joint improvement bound. Empirical validation on energy management in multi-microgrid systems demonstrates meaningful mean-variance trade-offs, validates monotonic improvement, and highlights the role of update order and initial policies in achieving high-quality local optima. Overall, the work provides the first theoretically-grounded, scalable algorithms with guarantees for risk-sensitive cooperative MARL and shows practical impact in distributed energy management settings.

Abstract

We study a long-run mean-variance team stochastic game (MV-TSG), where each agent shares a common mean-variance objective for the system and takes actions independently to maximize it. MV-TSG has two main challenges. First, the variance metric is neither additive nor Markovian in a dynamic setting. Second, simultaneous policy updates of all agents lead to a non-stationary environment for each individual agent. Both challenges make dynamic programming inapplicable. In this paper, we study MV-TSGs from the perspective of sensitivity-based optimization. The performance difference and performance derivative formulas for joint policies are derived, which provide optimization information for MV-TSGs. We prove the existence of a deterministic Nash policy for this problem. Subsequently, we propose a Mean-Variance Multi-Agent Policy Iteration (MV-MAPI) algorithm with a sequential update scheme, where individual agent policies are updated one by one in a given order. We prove that the MV-MAPI algorithm converges to a first-order stationary point of the objective function. By analyzing the local geometry of stationary points, we derive specific conditions for stationary points to be (local) Nash equilibria, and further, strict local optima. To solve large-scale MV-TSGs in scenarios with unknown environmental parameters, we extend the idea of trust region methods to MV-MAPI and develop a multi-agent reinforcement learning algorithm named Mean-Variance Multi-Agent Trust Region Policy Optimization (MV-MATRPO). We derive a performance lower bound for each update of joint policies. Finally, numerical experiments on energy management in multiple microgrid systems are conducted.

Paper Structure

This paper contains 29 sections, 13 theorems, 65 equations, 6 figures, 3 tables, 3 algorithms.

Key Result

Lemma 1

For any two joint policies $\bm{\mu}$, $\bm{\mu}' \in \mathcal{U}$, we have

Figures (6)

  • Figure 1: The optimality of the joint policy obtained by Algorithm \ref{['alg:mod_MVMAPI']}.
  • Figure 2: Architecture of a grid-connected multiple microgrids system.
  • Figure 3: The convergence procedure of Algorithm 1 under different values of $\beta$.
  • Figure 4: The convergence results under different initial policies or update orders when $\beta=1.0$.
  • Figure 5: Scenario 1: the training curves of Algorithm 2 under different values of $\beta$.
  • ...and 1 more figures

Theorems & Definitions (21)

  • Lemma 1: Performance Difference Formula for MV-TSGs
  • Lemma 2: Performance Derivative Formula for MV-TSGs
  • Definition 1: Local Nash Equilibrium
  • Remark 1
  • Theorem 1
  • Definition 2: First-order Stationary Point
  • Remark 2
  • Theorem 3
  • Corollary 1
  • Definition 3
  • ...and 11 more