Table of Contents
Fetching ...

Optimally Solving Simultaneous-Move Dec-POMDPs: The Sequential Central Planning Approach

Johan Peralez, Aurèlien Delage, Jacopo Castellini, Rafael F. Cunha, Jilles S. Dibangoye

TL;DR

The paper addresses scalability in solving decentralized partially observable Markov decision processes by replacing the conventional simultaneous-move central planning with sequential-move central planning for decentralized execution. It proves an exact equivalence between simultaneous- and sequential-move Dec-POMDPs, introduces sequential occupancy statistics (SOC) and the soMDP to enable PWLC structure, and shows that backups become polynomial in horizon, enabling efficient use of single-agent methods like SARSA with convergence guarantees. The oSARSA algorithm, adapted to the sequential framework, demonstrates superior performance over state-of-the-art ε-optimal solvers in two-agent and many-agent domains, and scales better to longer planning horizons and larger teams. This paradigm shifts the landscape for planning and reinforcement learning in multi-agent systems, facilitating scalable, provably convergent approaches to cooperative and mixed-motive settings.

Abstract

The centralized training for decentralized execution paradigm emerged as the state-of-the-art approach to $ε$-optimally solving decentralized partially observable Markov decision processes. However, scalability remains a significant issue. This paper presents a novel and more scalable alternative, namely the sequential-move centralized training for decentralized execution. This paradigm further pushes the applicability of the Bellman's principle of optimality, raising three new properties. First, it allows a central planner to reason upon sufficient sequential-move statistics instead of prior simultaneous-move ones. Next, it proves that $ε$-optimal value functions are piecewise linear and convex in such sufficient sequential-move statistics. Finally, it drops the complexity of the backup operators from double exponential to polynomial at the expense of longer planning horizons. Besides, it makes it easy to use single-agent methods, e.g., SARSA algorithm enhanced with these findings, while still preserving convergence guarantees. Experiments on two- as well as many-agent domains from the literature against $ε$-optimal simultaneous-move solvers confirm the superiority of our novel approach. This paradigm opens the door for efficient planning and reinforcement learning methods for multi-agent systems.

Optimally Solving Simultaneous-Move Dec-POMDPs: The Sequential Central Planning Approach

TL;DR

The paper addresses scalability in solving decentralized partially observable Markov decision processes by replacing the conventional simultaneous-move central planning with sequential-move central planning for decentralized execution. It proves an exact equivalence between simultaneous- and sequential-move Dec-POMDPs, introduces sequential occupancy statistics (SOC) and the soMDP to enable PWLC structure, and shows that backups become polynomial in horizon, enabling efficient use of single-agent methods like SARSA with convergence guarantees. The oSARSA algorithm, adapted to the sequential framework, demonstrates superior performance over state-of-the-art ε-optimal solvers in two-agent and many-agent domains, and scales better to longer planning horizons and larger teams. This paradigm shifts the landscape for planning and reinforcement learning in multi-agent systems, facilitating scalable, provably convergent approaches to cooperative and mixed-motive settings.

Abstract

The centralized training for decentralized execution paradigm emerged as the state-of-the-art approach to -optimally solving decentralized partially observable Markov decision processes. However, scalability remains a significant issue. This paper presents a novel and more scalable alternative, namely the sequential-move centralized training for decentralized execution. This paradigm further pushes the applicability of the Bellman's principle of optimality, raising three new properties. First, it allows a central planner to reason upon sufficient sequential-move statistics instead of prior simultaneous-move ones. Next, it proves that -optimal value functions are piecewise linear and convex in such sufficient sequential-move statistics. Finally, it drops the complexity of the backup operators from double exponential to polynomial at the expense of longer planning horizons. Besides, it makes it easy to use single-agent methods, e.g., SARSA algorithm enhanced with these findings, while still preserving convergence guarantees. Experiments on two- as well as many-agent domains from the literature against -optimal simultaneous-move solvers confirm the superiority of our novel approach. This paradigm opens the door for efficient planning and reinforcement learning methods for multi-agent systems.
Paper Structure (52 sections, 25 theorems, 57 equations, 5 figures, 5 tables, 3 algorithms)

This paper contains 52 sections, 25 theorems, 57 equations, 5 figures, 5 tables, 3 algorithms.

Key Result

Lemma 1

[Proof in Appendix C.1] Let $\sigma\colon \mathbb{N}^*_{\leq n} \mapsto \mathbb{N}^*_{\leq n}$ be a permutation over agents. bellman's optimality equations (eq:bellman) can be re-written in sequential form with no loss in optimality, i.e., for all $t\in \mathbb{N}_{\leq \ell}$ and any $t$-th simult

Figures (5)

  • Figure 1: Influence diagram for a $2$-agent sequential-move Markov decision process.
  • Figure 2: Average iteration time for two-agent mars and grid3x3 problems with different planning horizons. Note that for the larger planning horizon ($\ell=100$) on grid3x3, $o$SARSA$^\mathtt{sim}$ is unable to complete even one iteration within the one-hour period.
  • Figure 3: Average iteration time for many-agent recycling and tiger problems with different team sizes.
  • Figure 4: Anytime values for the two-agent mars problem with planning horizons $\ell=20$ and $\ell=40$
  • Figure 5: Anytime values for many-agent recycling problem with $n=2$ and $n=4$ agents.

Theorems & Definitions (47)

  • Definition 1
  • Definition 2
  • Lemma 1
  • Definition 3
  • Theorem 1
  • Definition 4
  • Definition 5
  • Lemma 2
  • Definition 6
  • Lemma 3
  • ...and 37 more