Capacity-Aware Planning and Scheduling in Budget-Constrained Multi-Agent MDPs: A Meta-RL Approach
Manav Vora, Ilan Shomorony, Melkior Ornik
TL;DR
This work addresses planning under concurrent capacity and budget constraints in large-scale multi-agent MDPs by introducing a two-stage approach: first, partition the agents into $r$ groups via a Linear Sum Assignment Problem (LSAP) that maximizes diversity in time-to-failure, producing $r$ independent sub-MDPs; second, train a single meta-PPO policy that rapidly adapts to each sub-MDP and can be deployed in parallel. The method converts an intractable joint-action problem into tractable subproblems and promotes policy transfer across groups, resulting in near-linear scalability in the number of agents. Empirical results on a robot-maintenance benchmark with up to $n=1000$ agents show that LSAP+Meta-PPO outperforms ILP, MILP, and several RL baselines in average uptime and budget pacing, with complexity analyses confirming scalability. The proposed framework offers a practical path to scalable, budget-aware maintenance and scheduling in large fleets, reducing overtime and improving system uptime in real-world facilities.
Abstract
We study capacity- and budget-constrained multi-agent MDPs (CB-MA-MDPs), a class that captures many maintenance and scheduling tasks in which each agent can irreversibly fail and a planner must decide (i) when to apply a restorative action and (ii) which subset of agents to treat in parallel. The global budget limits the total number of restorations, while the capacity constraint bounds the number of simultaneous actions, turning naïve dynamic programming into a combinatorial search that scales exponentially with the number of agents. We propose a two-stage solution that remains tractable for large systems. First, a Linear Sum Assignment Problem (LSAP)-based grouping partitions the agents into r disjoint sets (r = capacity) that maximise diversity in expected time-to-failure, allocating budget to each set proportionally. Second, a meta-trained PPO policy solves each sub-MDP, leveraging transfer across groups to converge rapidly. To validate our approach, we apply it to the problem of scheduling repairs for a large team of industrial robots, constrained by a limited number of repair technicians and a total repair budget. Our results demonstrate that the proposed method outperforms baseline approaches in terms of maximizing the average uptime of the robot team, particularly for large team sizes. Lastly, we confirm the scalability of our approach through a computational complexity analysis across varying numbers of robots and repair technicians.
