Multi-Agent Reinforcement Learning with Hierarchical Coordination for Emergency Responder Stationing

Amutheezan Sivagnanam; Ava Pettet; Hunter Lee; Ayan Mukhopadhyay; Abhishek Dubey; Aron Laszka

Multi-Agent Reinforcement Learning with Hierarchical Coordination for Emergency Responder Stationing

Amutheezan Sivagnanam, Ava Pettet, Hunter Lee, Ayan Mukhopadhyay, Abhishek Dubey, Aron Laszka

TL;DR

This work reframes proactive emergency responder stationing as a hierarchical, multi-agent reinforcement learning problem and replaces computationally expensive online search with fast, learned policies. By coupling low-level region-specific LLPs (Transformer-XL-based actors) with a high-level orchestrator that reallocates responders across regions, and by mapping continuous actions to discrete allocations via minimum-cost flow and maximum-weight matching, the approach achieves real-time decision-making. Rewards at the high level are estimated using critics from the low-level agents to stabilize optimization across asynchronous regions. Empirical evaluation on Nashville and Seattle data shows the method reduces per-decision computation by orders of magnitude and yields modest improvements in average response times, illustrating practical impact for time-critical ERM systems.

Abstract

An emergency responder management (ERM) system dispatches responders, such as ambulances, when it receives requests for medical aid. ERM systems can also proactively reposition responders between predesignated waiting locations to cover any gaps that arise due to the prior dispatch of responders or significant changes in the distribution of anticipated requests. Optimal repositioning is computationally challenging due to the exponential number of ways to allocate responders between locations and the uncertainty in future requests. The state-of-the-art approach in proactive repositioning is a hierarchical approach based on spatial decomposition and online Monte Carlo tree search, which may require minutes of computation for each decision in a domain where seconds can save lives. We address the issue of long decision times by introducing a novel reinforcement learning (RL) approach, based on the same hierarchical decomposition, but replacing online search with learning. To address the computational challenges posed by large, variable-dimensional, and discrete state and action spaces, we propose: (1) actor-critic based agents that incorporate transformers to handle variable-dimensional states and actions, (2) projections to fixed-dimensional observations to handle complex states, and (3) combinatorial techniques to map continuous actions to discrete allocations. We evaluate our approach using real-world data from two U.S. cities, Nashville, TN and Seattle, WA. Our experiments show that compared to the state of the art, our approach reduces computation time per decision by three orders of magnitude, while also slightly reducing average ambulance response time by 5 seconds.

Multi-Agent Reinforcement Learning with Hierarchical Coordination for Emergency Responder Stationing

TL;DR

Abstract

Paper Structure (62 sections, 17 equations, 13 figures, 5 tables, 1 algorithm)

This paper contains 62 sections, 17 equations, 13 figures, 5 tables, 1 algorithm.

Introduction
Problem Formulation
Model
State
Transition
Action
Reward
Hierarchical Decision Framework
Solution Approach
Low-Level Decision Agent: Reallocating Responders within a Region
Actor Input
Actor Network
Discrete Action
Critic
High-Level Decision Agent: Reallocating Responders between Regions
...and 47 more sections

Figures (13)

Figure 1: High-level overview of state-of-the-art hierarchical framework pettet2021hierarchical2, described in \ref{['subsec:hierarchical_framework']}.
Figure 2: Overview of the low-level RL agent training process using DDPG for a region $\textit{g} \in \mathcal{G}$. First, we map the complex, variable-dimensional state ($s^\textit{g}_{\textit{t}}$) to a sequence of feature vectors, which we feed to the actor to obtain a continuous action ($\textbf{a}_{\textit{t}}^{\textit{g}}$). Next, we discretize the continuous action using maximum weight matching to allocate responders within the region $\textit{g}$. Finally, we use the critic to judge the performance of the actor by feeding the state and action as fixed-sized vectors to the critic and perform learning against response time to serve the incident.
Figure 3: Overview of the training process of the high-level RL agent using DDPG. First, we map the state to a fixed-size feature vector and feed it into the MLP-based actor to generate the continuous action ($\textbf{a}_{\textit{t}}^{\textit{h}}$). Next, we discretize the continuous action and feed it into the minimum-cost flow problem to generate the assignment of responders to depots ($\mathcal{A}$). After that, we trigger those LLPs whose regions were affected by the high-level reallocation. Finally, we use the critic to judge the performance of the actor by training the critic with rewards estimated by the LLP critics.
Figure 4: Distribution of average response times (lower is better) with our approach ($\blacksquare$), MCTS ($\blacksquare$), $p$-median with $\alpha$ = 1.0 ($\blacksquare$), greedy policy ($\blacksquare$), and static policy, i.e., no proactive repositioning ($\blacksquare$) for 10 different sample incident chains with (a) 24 responders, (b) 26 responders, and (c) 28 responders. (d) distribution of average response times using MCTS ($\blacksquare$) and various architectures as the actor for the low-level agent (TrXL ($\blacksquare$), GTrXL ($\blacksquare$), and LSTM ($\blacksquare$)), trained and evaluated with the HLP from prior work pettet2021hierarchical2 for 10 different sample incident chains with 26 responders (Nashville).
Figure 5: Distribution of average response times (lower is better) with our approach ($\blacksquare$), MCTS ($\blacksquare$), and DRLSN ($\blacksquare$) for 10 different sample incident chains (Nashville). In this figure, we plot the same data for our approach and MCTS as in \ref{['fig:city_level_response_times_twenties_nashville_twenty_four', 'fig:city_level_response_times_twenties_nashville_twenty_six', 'fig:city_level_response_times_twenties_nashville_twenty_eight']}; the only difference is the inclusion of DRLSN, which changes the scaling of the vertical axis.
...and 8 more figures

Multi-Agent Reinforcement Learning with Hierarchical Coordination for Emergency Responder Stationing

TL;DR

Abstract

Multi-Agent Reinforcement Learning with Hierarchical Coordination for Emergency Responder Stationing

Authors

TL;DR

Abstract

Table of Contents

Figures (13)