The price of decentralization in managing engineering systems through multi-agent reinforcement learning

Prateek Bhustali; Pablo G. Morato; Konstantinos G. Papakonstantinou; Charalampos P. Andriotis

The price of decentralization in managing engineering systems through multi-agent reinforcement learning

Prateek Bhustali, Pablo G. Morato, Konstantinos G. Papakonstantinou, Charalampos P. Andriotis

Abstract

Inspection and maintenance (I&M) planning involves sequential decision making under uncertainties and incomplete information, and can be modeled as a partially observable Markov decision process (POMDP). While single-agent deep reinforcement learning provides approximate solutions to POMDPs, it does not scale well in multi-component systems. Scalability can be achieved through multi-agent deep reinforcement learning (MADRL), which decentralizes decision-making across multiple agents, locally controlling individual components. However, this decentralization can induce cooperation pathologies that degrade the optimality of the learned policies. To examine these effects in I&M planning, we introduce a set of deteriorating systems in which redundancy is varied systematically. These benchmark environments are designed such that computation of centralized (near-)optimal policies remains tractable, enabling direct comparison of solution methods. We implement and benchmark a broad set of MADRL algorithms spanning fully centralized and decentralized training paradigms, from value-factorization to actor-critic methods. Our results show a clear effect of redundancy on coordination: MADRL algorithms achieve near-optimal performance in series-like settings, whereas increasing redundancy amplifies coordination challenges and can lead to optimality losses. Nonetheless, decentralized agents learn structured policies that consistently outperform optimized heuristic baselines, highlighting both the promise and current limitations of decentralized learning for scalable maintenance planning.

The price of decentralization in managing engineering systems through multi-agent reinforcement learning

Abstract

Paper Structure (45 sections, 1 theorem, 31 equations, 14 figures, 5 tables)

This paper contains 45 sections, 1 theorem, 31 equations, 14 figures, 5 tables.

Introduction
Related work
Background: Formulation and solution frameworks
Partially Observable Markov Decision Process (POMDP)
Point-based POMDP solvers
Decentralized Partially Observable Markov Decision Process (Dec-POMDP)
Multi-agent POMDP (MPOMDP)
Multi-agent Deep Reinforcement Learning (MADRL)
Paradigms and algorithms
Centralized Training Centralized Execution (CTCE)
Centralized Training Decentralized Execution (CTDE)
Decentralized Training Decentralized Execution (DTDE)
Multi-agent learning pathologies
Climb Game
Multi-agent I&M environments
...and 30 more sections

Key Result

Proposition 1

Consider a two-agent, single-step matrix game with agents $m\in\{1,2\}$ and action sets $A_m=\{\mathtt{DN},\mathtt{Rep}.\}$, where $\mathtt{DN}$ denotes do nothing and $\mathtt{Rep}.$ denotes repair. Let $r: A_1\times A_2 \to \mathbb{R}$ be the joint reward function, and let $c_1,c_2>0$ denote the r (a) Under linear value decomposition (VDN), the induced greedy joint action is $(\mathtt{Rep}.,\ma

Figures (14)

Figure 1: In multi-component systems, joint state, action, and observation spaces grow exponentially with the number of components, rendering single-agent approaches intractable. Dec-POMDPs address this curse of dimensionality by decentralizing control and are commonly solved using multi-agent DRL methods. Left: Scalability of solution approaches. The curved boundaries illustrate the approximate, relative scalability of POMDP and Dec-POMDP solution approaches as these spaces increase. The () indicates the approximate size of the systems studied in this work, where single-agent methods remain tractable and allow direct comparison with multi-agent approaches. Right: Agent-centric paradigms. Single-agent approaches are intrinsically centralized because a single agent possesses all information during training and execution. In contrast, multi-agent approaches, such as CTDE, relax decentralization during training by allowing agents to share information, but agents must learn to execute actions independently at execution/inference.
Figure 2: A Dec-POMDP unrolled as a dynamic decision network over three time steps. At each time step $t$, the environment state $s^t$ evolves according to the transition model $s^{t+1} \sim T(\cdot \mid s^t, \mathbf{a}^t)$ under the joint action $\mathbf{a}^t$, yielding a global reward $r^t$ and individual observations $\mathbf{o}^{t+1} \sim O(\cdot \mid s^{t+1},\mathbf{a}^t)$ for each agent $m \in \mathcal{M}$. Each agent selects its action based on its local action–observation history $h_m^{t}$.
Figure 3: An overview of MADRL algorithms for managing a (four-component) reliability system. Algorithms are classified by training–execution paradigms: centralized training centralized execution (CTCE), centralized training decentralized execution (CTDE), and decentralized training decentralized execution (DTDE), reflecting increasing decentralization and information constraints. Bottom labels indicate the underlying formulation framework (POMDP, MPOMDP, Dec-POMDP). Suffix PS denotes parameter sharing across agents.
Figure 4: Left: Payoff matrix for the Climb Game with two agents, each choosing from three actions. The joint action $(\mathfrak{a}^1, \mathfrak{a}^1)$ yields the optimal reward of 11 claus_dynamics_1998. Right: Best performance of MADRL algorithms aggregated over 30 training instances in the Climb Game played repeatedly for 25 time steps papoudakis_benchmarking_2021. The optimal return is $11\times25 = 275$.
Figure 5: Box plots summarize algorithm performance across $k$-out-of-4 systems, aggregated over 10 training instances. The results are normalized using the expected cost of the (near-)optimal SARSOP policy (Table \ref{['tab:best_results_kn_infinitehorizon']}), allowing comparison across varying redundancy levels. Most decentralized algorithms perform near-optimally in series configurations $(k=n)$, but deviate from the (near-)optimal SARSOP baseline as redundancy increases, with the largest performance drop observed in the parallel configurations $(k=1)$.
...and 9 more figures

Theorems & Definitions (2)

Proposition 1: Limitations of value factorization under redundancy
Proof 1

The price of decentralization in managing engineering systems through multi-agent reinforcement learning

Abstract

The price of decentralization in managing engineering systems through multi-agent reinforcement learning

Authors

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (2)