Table of Contents
Fetching ...

Second-Order MPC-Based Distributed Q-Learning

Samuel Mallick, Filippo Airaldi, Azita Dabiri, Bart De Schutter

TL;DR

This paper tackles accelerating distributed MPC-based Q-learning for multi-agent systems with privacy constraints by introducing a second-order update. It derives a distributed Hessian-informed update that decomposes globally into per-agent calculations using consensus on a structured gradient/Hessian term, solving $(m{H}+m{Lambda})m{d}=m{q}$ with $ m{q} = - rac{1}{T} extstyle\sum_{ au} oldsymbol{ delta} abla_ heta Q_ heta(s_ au,a_ au)$ and $ m{H} = rac{1}{T} extstyleig( abla_ heta Q_ heta abla_ heta Q_ heta^ op - abla_ heta^2 Q_ hetaig)$. Through consensus on a matrix $m{C}$, the update becomes $m{d}_i = - ilde{m{K}}_i m{G}_i (oldsymbol{ delta} - (m{I}+m{C})^{-1}m{C}oldsymbol{ delta})$, enabling fully distributed computation. Simulations on a three-agent network show the distributed second-order method matches centralized second-order performance and outperforms the first-order variant, with communication scaling as $O(T^2)$ and remaining independent of the network size $M$. These results indicate significantly faster and more stable learning for distributed MPC-based RL, while preserving locality and privacy. $J( heta)$, $oldsymbol{ delta}$, $g_t$, and the Hessian terms are all handled with $ $delimiters$ in the narrative to clarify the mathematical structure.

Abstract

The state of the art for model predictive control (MPC)-based distributed Q-learning is limited to first-order gradient updates of the MPC parameterization. In general, using secondorder information can significantly improve the speed of convergence for learning, allowing the use of higher learning rates without introducing instability. This work presents a second-order extension to MPC-based Q-learning with updates distributed across local agents, relying only on locally available information and neighbor-to-neighbor communication. In simulation the approach is demonstrated to significantly outperform first-order distributed Q-learning.

Second-Order MPC-Based Distributed Q-Learning

TL;DR

This paper tackles accelerating distributed MPC-based Q-learning for multi-agent systems with privacy constraints by introducing a second-order update. It derives a distributed Hessian-informed update that decomposes globally into per-agent calculations using consensus on a structured gradient/Hessian term, solving with and . Through consensus on a matrix , the update becomes , enabling fully distributed computation. Simulations on a three-agent network show the distributed second-order method matches centralized second-order performance and outperforms the first-order variant, with communication scaling as and remaining independent of the network size . These results indicate significantly faster and more stable learning for distributed MPC-based RL, while preserving locality and privacy. , , , and the Hessian terms are all handled with delimiters$ in the narrative to clarify the mathematical structure.

Abstract

The state of the art for model predictive control (MPC)-based distributed Q-learning is limited to first-order gradient updates of the MPC parameterization. In general, using secondorder information can significantly improve the speed of convergence for learning, allowing the use of higher learning rates without introducing instability. This work presents a second-order extension to MPC-based Q-learning with updates distributed across local agents, relying only on locally available information and neighbor-to-neighbor communication. In simulation the approach is demonstrated to significantly outperform first-order distributed Q-learning.

Paper Structure

This paper contains 10 sections, 49 equations, 2 figures.

Figures (2)

  • Figure 1: Moving average (100 steps) of TD error (top) and global stage cost (bottom, log scale). Five training instances are shown, with solid lines the median and shaded areas the interval between the 32nd and 68th percentiles.
  • Figure 2: State and action trajectories of agents during a learning instance.

Theorems & Definitions (3)

  • Remark 1
  • Remark 2
  • Remark 3