Table of Contents
Fetching ...

Scalable spectral representations for multi-agent reinforcement learning in network MDPs

Zhaolin Ren, Runyu Zhang, Bo Dai, Na Li

TL;DR

This work first derive scalable spectral local representations for network MDPs, which induces a network linear subspace for the local $Q$-function of each agent, and designs a scalable algorithmic framework for continuous state-action network MDPs, and provides end-to-end guarantees for the convergence of the algorithm.

Abstract

Network Markov Decision Processes (MDPs), a popular model for multi-agent control, pose a significant challenge to efficient learning due to the exponential growth of the global state-action space with the number of agents. In this work, utilizing the exponential decay property of network dynamics, we first derive scalable spectral local representations for network MDPs, which induces a network linear subspace for the local $Q$-function of each agent. Building on these local spectral representations, we design a scalable algorithmic framework for continuous state-action network MDPs, and provide end-to-end guarantees for the convergence of our algorithm. Empirically, we validate the effectiveness of our scalable representation-based approach on two benchmark problems, and demonstrate the advantages of our approach over generic function approximation approaches to representing the local $Q$-functions.

Scalable spectral representations for multi-agent reinforcement learning in network MDPs

TL;DR

This work first derive scalable spectral local representations for network MDPs, which induces a network linear subspace for the local -function of each agent, and designs a scalable algorithmic framework for continuous state-action network MDPs, and provides end-to-end guarantees for the convergence of the algorithm.

Abstract

Network Markov Decision Processes (MDPs), a popular model for multi-agent control, pose a significant challenge to efficient learning due to the exponential growth of the global state-action space with the number of agents. In this work, utilizing the exponential decay property of network dynamics, we first derive scalable spectral local representations for network MDPs, which induces a network linear subspace for the local -function of each agent. Building on these local spectral representations, we design a scalable algorithmic framework for continuous state-action network MDPs, and provide end-to-end guarantees for the convergence of our algorithm. Empirically, we validate the effectiveness of our scalable representation-based approach on two benchmark problems, and demonstrate the advantages of our approach over generic function approximation approaches to representing the local -functions.

Paper Structure

This paper contains 27 sections, 20 theorems, 124 equations, 4 figures, 1 algorithm.

Key Result

Lemma 1

Suppose the probability transition $P(s' \mid s,a)$ of the next state $s'$ given the current $(s,a)$ pair can be linearly decomposed as $P(s' \mid s,a) = \phi(s,a)^\top \mu(s')$ for some features $\phi(s,a) \in \mathbb{R}^D$ and $\mu(s') \in \mathbb{R}^D$, which we also refer to as spectral represen where

Figures (4)

  • Figure 1: Learning trajectories of cost (lower is better) using Algorithm 1 + random features and NN critics on a 50-dimensional stochastic linear dynamical system for varying $\kappa_\pi$. Average and 1 std confidence intervals over 5 seeds.
  • Figure 2: Change in reward during training for Kuramoto oscillator control, $n = 40$, $\kappa_\pi = 1, \kappa = 2$. The performance for each algorithm is averaged over 5 seeds.
  • Figure 3: Synchronization of frequency ($\dot\theta$) under SAC and Spectral + SAC controller, for 1600 time steps on a single trajectory. Each curve represents a different agent.
  • Figure 4: Change in reward during training for Kuramoto oscillator control, $n = 40$, $\kappa_\pi = 1, \kappa = 2$. In this experiment, the dynamics model is known. The performance for each algorithm is averaged over 5 seeds.

Theorems & Definitions (24)

  • Example 1: Kuramoto oscillator synchronization
  • Definition 1
  • Lemma 1: Representing local $Q_i$-value functions via spectral decomposition of $P$ (cf. jin2020provably)
  • Lemma 2
  • Remark 1
  • Lemma 3: Local $Q_i$ approximation via network $\kappa$-local spectral features
  • Lemma 4
  • Lemma 5
  • Lemma 6: Policy Evaluation Error
  • Theorem 1
  • ...and 14 more