Table of Contents
Fetching ...

Multi-Agent Reinforcement Learning via Distributed MPC as a Function Approximator

Samuel Mallick, Filippo Airaldi, Azita Dabiri, Bart De Schutter

TL;DR

The paper tackles multi-agent reinforcement learning for linear systems with convex constraints by using a structured distributed MPC scheme as a function approximator for the policy and value functions. It develops a distributed Q-learning framework where ADMM and GAC enable fully decentralized evaluation and learning, with local dual variables aligning to the centralized optimum. A key theoretical result links local ADMM duals to the global duals, enabling per-agent updates that reproduce centralized learning while preserving privacy. Empirical results on an academic chain and a power-system network demonstrate comparable performance to centralized methods and robust constraint satisfaction under model uncertainty. The work highlights the potential of combining distributed optimization with MPC-based RL to achieve safe, interpretable, and scalable MARL in networked systems.

Abstract

This paper presents a novel approach to multi-agent reinforcement learning (RL) for linear systems with convex polytopic constraints. Existing work on RL has demonstrated the use of model predictive control (MPC) as a function approximator for the policy and value functions. The current paper is the first work to extend this idea to the multi-agent setting. We propose the use of a distributed MPC scheme as a function approximator, with a structure allowing for distributed learning and deployment. We then show that Q-learning updates can be performed distributively without introducing nonstationarity, by reconstructing a centralized learning update. The effectiveness of the approach is demonstrated on two numerical examples.

Multi-Agent Reinforcement Learning via Distributed MPC as a Function Approximator

TL;DR

The paper tackles multi-agent reinforcement learning for linear systems with convex constraints by using a structured distributed MPC scheme as a function approximator for the policy and value functions. It develops a distributed Q-learning framework where ADMM and GAC enable fully decentralized evaluation and learning, with local dual variables aligning to the centralized optimum. A key theoretical result links local ADMM duals to the global duals, enabling per-agent updates that reproduce centralized learning while preserving privacy. Empirical results on an academic chain and a power-system network demonstrate comparable performance to centralized methods and robust constraint satisfaction under model uncertainty. The work highlights the potential of combining distributed optimization with MPC-based RL to achieve safe, interpretable, and scalable MARL in networked systems.

Abstract

This paper presents a novel approach to multi-agent reinforcement learning (RL) for linear systems with convex polytopic constraints. Existing work on RL has demonstrated the use of model predictive control (MPC) as a function approximator for the policy and value functions. The current paper is the first work to extend this idea to the multi-agent setting. We propose the use of a distributed MPC scheme as a function approximator, with a structure allowing for distributed learning and deployment. We then show that Q-learning updates can be performed distributively without introducing nonstationarity, by reconstructing a centralized learning update. The effectiveness of the approach is demonstrated on two numerical examples.
Paper Structure (19 sections, 41 equations, 8 figures, 1 algorithm)

This paper contains 19 sections, 41 equations, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: Accuracy of the dual variables recovered by Proposition \ref{['prop:duals']} as a function of the ADMM iteration index $\tau$.
  • Figure 2: Centralized (left) and distributed (right). Evolution of the states and inputs during training. Agent 1 (blue), agent 2 (red), agent 3 (purple), and bounds (dashed).
  • Figure 3: Evolution of TD errors (top) and stage costs (bottom) during training.
  • Figure 4: Evolution of learnable parameters for agent $2$ during training. Distributed (blue) and centralized (red).
  • Figure 5: NMPC and SMPC compared against policy at training time steps $t$ (log scale).
  • ...and 3 more figures