Table of Contents
Fetching ...

Capacity-Aware Planning and Scheduling in Budget-Constrained Multi-Agent MDPs: A Meta-RL Approach

Manav Vora, Ilan Shomorony, Melkior Ornik

TL;DR

This work addresses planning under concurrent capacity and budget constraints in large-scale multi-agent MDPs by introducing a two-stage approach: first, partition the agents into $r$ groups via a Linear Sum Assignment Problem (LSAP) that maximizes diversity in time-to-failure, producing $r$ independent sub-MDPs; second, train a single meta-PPO policy that rapidly adapts to each sub-MDP and can be deployed in parallel. The method converts an intractable joint-action problem into tractable subproblems and promotes policy transfer across groups, resulting in near-linear scalability in the number of agents. Empirical results on a robot-maintenance benchmark with up to $n=1000$ agents show that LSAP+Meta-PPO outperforms ILP, MILP, and several RL baselines in average uptime and budget pacing, with complexity analyses confirming scalability. The proposed framework offers a practical path to scalable, budget-aware maintenance and scheduling in large fleets, reducing overtime and improving system uptime in real-world facilities.

Abstract

We study capacity- and budget-constrained multi-agent MDPs (CB-MA-MDPs), a class that captures many maintenance and scheduling tasks in which each agent can irreversibly fail and a planner must decide (i) when to apply a restorative action and (ii) which subset of agents to treat in parallel. The global budget limits the total number of restorations, while the capacity constraint bounds the number of simultaneous actions, turning naïve dynamic programming into a combinatorial search that scales exponentially with the number of agents. We propose a two-stage solution that remains tractable for large systems. First, a Linear Sum Assignment Problem (LSAP)-based grouping partitions the agents into r disjoint sets (r = capacity) that maximise diversity in expected time-to-failure, allocating budget to each set proportionally. Second, a meta-trained PPO policy solves each sub-MDP, leveraging transfer across groups to converge rapidly. To validate our approach, we apply it to the problem of scheduling repairs for a large team of industrial robots, constrained by a limited number of repair technicians and a total repair budget. Our results demonstrate that the proposed method outperforms baseline approaches in terms of maximizing the average uptime of the robot team, particularly for large team sizes. Lastly, we confirm the scalability of our approach through a computational complexity analysis across varying numbers of robots and repair technicians.

Capacity-Aware Planning and Scheduling in Budget-Constrained Multi-Agent MDPs: A Meta-RL Approach

TL;DR

This work addresses planning under concurrent capacity and budget constraints in large-scale multi-agent MDPs by introducing a two-stage approach: first, partition the agents into groups via a Linear Sum Assignment Problem (LSAP) that maximizes diversity in time-to-failure, producing independent sub-MDPs; second, train a single meta-PPO policy that rapidly adapts to each sub-MDP and can be deployed in parallel. The method converts an intractable joint-action problem into tractable subproblems and promotes policy transfer across groups, resulting in near-linear scalability in the number of agents. Empirical results on a robot-maintenance benchmark with up to agents show that LSAP+Meta-PPO outperforms ILP, MILP, and several RL baselines in average uptime and budget pacing, with complexity analyses confirming scalability. The proposed framework offers a practical path to scalable, budget-aware maintenance and scheduling in large fleets, reducing overtime and improving system uptime in real-world facilities.

Abstract

We study capacity- and budget-constrained multi-agent MDPs (CB-MA-MDPs), a class that captures many maintenance and scheduling tasks in which each agent can irreversibly fail and a planner must decide (i) when to apply a restorative action and (ii) which subset of agents to treat in parallel. The global budget limits the total number of restorations, while the capacity constraint bounds the number of simultaneous actions, turning naïve dynamic programming into a combinatorial search that scales exponentially with the number of agents. We propose a two-stage solution that remains tractable for large systems. First, a Linear Sum Assignment Problem (LSAP)-based grouping partitions the agents into r disjoint sets (r = capacity) that maximise diversity in expected time-to-failure, allocating budget to each set proportionally. Second, a meta-trained PPO policy solves each sub-MDP, leveraging transfer across groups to converge rapidly. To validate our approach, we apply it to the problem of scheduling repairs for a large team of industrial robots, constrained by a limited number of repair technicians and a total repair budget. Our results demonstrate that the proposed method outperforms baseline approaches in terms of maximizing the average uptime of the robot team, particularly for large team sizes. Lastly, we confirm the scalability of our approach through a computational complexity analysis across varying numbers of robots and repair technicians.

Paper Structure

This paper contains 34 sections, 1 theorem, 21 equations, 5 figures, 3 tables.

Key Result

Theorem 1

Let $X \in \mathbb{R}^{nk \times nk}$ be a symmetric random matrix where the entries $\{ X_{ij} \mid i \leq j \}$ are independently and identically distributed (i.i.d.) standard normal random variables, i.e., $X_{ij} \sim \mathcal{N}(0,1)$ for $i \leq j$, and $X_{ji} = X_{ij}$ for $i > j$. Let $\mat

Figures (5)

  • Figure 1: Architectural overview of the proposed approach.
  • Figure 2: Distribution of episode survival time $T_{\mathrm{abs}}$ for $(n,r)=(100,30)$ across methods (100 runs each). Boxes show the inter-quartile range, whiskers represent the 5--95th percentiles; markers denote means and circles denote outliers.
  • Figure 3: Budget sensitivity for $(n,r)=(100,30)$: mean survival time vs. total budget $B$. Shaded bands are $\pm \sigma$ over 100 runs; ILP/MILP not shown (timeout).
  • Figure 4: Log-log plot showing the computational complexity of the proposed approach for varying number of robots. $(n,r)$ values are the same as Table \ref{['tab:stepwise_complexity']}.
  • Figure 5: Heatmap showing the time taken (in seconds) for the proposed approach, across different numbers of robots and repair technicians. Values are plotted on a logarithmic scale to better capture the variation.

Theorems & Definitions (2)

  • Theorem 1
  • proof