Table of Contents
Fetching ...

Factorized Deep Q-Network for Cooperative Multi-Agent Reinforcement Learning in Victim Tagging

Maria Ana Cardei, Afsaneh Doryab

TL;DR

This work addresses the challenge of minimizing victim tagging time in mass casualty incidents under uncertainty by formulating an ILP baseline and introducing five distributed heuristics that reflect varying communication capabilities. It then presents a Factorized Deep Q-Network (FDQN) MARL approach with a shared global state and decentralized actions, augmented by action masking to enable cooperative victim tagging. Through extensive simulations, local, uncertainty-aware heuristics consistently outperform global strategies, while FDQN demonstrates gains in smaller-scale scenarios but struggles as problem size grows, indicating complementary roles for learning and heuristics. Overall, the study provides actionable guidance for emergency response planning and highlights the potential and current limits of MARL in large-scale, real-time disaster response.

Abstract

Mass casualty incidents (MCIs) are a growing concern, characterized by complexity and uncertainty that demand adaptive decision-making strategies. The victim tagging step in the emergency medical response must be completed quickly and is crucial for providing information to guide subsequent time-constrained response actions. In this paper, we present a mathematical formulation of multi-agent victim tagging to minimize the time it takes for responders to tag all victims. Five distributed heuristics are formulated and evaluated with simulation experiments. The heuristics considered are on-the go, practical solutions that represent varying levels of situational uncertainty in the form of global or local communication capabilities, showcasing practical constraints. We further investigate the performance of a multi-agent reinforcement learning (MARL) strategy, factorized deep Q-network (FDQN), to minimize victim tagging time as compared to baseline heuristics. Extensive simulations demonstrate that between the heuristics, methods with local communication are more efficient for adaptive victim tagging, specifically choosing the nearest victim with the option to replan. Analyzing all experiments, we find that our FDQN approach outperforms heuristics in smaller-scale scenarios, while heuristics excel in more complex scenarios. Our experiments contain diverse complexities that explore the upper limits of MARL capabilities for real-world applications and reveal key insights.

Factorized Deep Q-Network for Cooperative Multi-Agent Reinforcement Learning in Victim Tagging

TL;DR

This work addresses the challenge of minimizing victim tagging time in mass casualty incidents under uncertainty by formulating an ILP baseline and introducing five distributed heuristics that reflect varying communication capabilities. It then presents a Factorized Deep Q-Network (FDQN) MARL approach with a shared global state and decentralized actions, augmented by action masking to enable cooperative victim tagging. Through extensive simulations, local, uncertainty-aware heuristics consistently outperform global strategies, while FDQN demonstrates gains in smaller-scale scenarios but struggles as problem size grows, indicating complementary roles for learning and heuristics. Overall, the study provides actionable guidance for emergency response planning and highlights the potential and current limits of MARL in large-scale, real-time disaster response.

Abstract

Mass casualty incidents (MCIs) are a growing concern, characterized by complexity and uncertainty that demand adaptive decision-making strategies. The victim tagging step in the emergency medical response must be completed quickly and is crucial for providing information to guide subsequent time-constrained response actions. In this paper, we present a mathematical formulation of multi-agent victim tagging to minimize the time it takes for responders to tag all victims. Five distributed heuristics are formulated and evaluated with simulation experiments. The heuristics considered are on-the go, practical solutions that represent varying levels of situational uncertainty in the form of global or local communication capabilities, showcasing practical constraints. We further investigate the performance of a multi-agent reinforcement learning (MARL) strategy, factorized deep Q-network (FDQN), to minimize victim tagging time as compared to baseline heuristics. Extensive simulations demonstrate that between the heuristics, methods with local communication are more efficient for adaptive victim tagging, specifically choosing the nearest victim with the option to replan. Analyzing all experiments, we find that our FDQN approach outperforms heuristics in smaller-scale scenarios, while heuristics excel in more complex scenarios. Our experiments contain diverse complexities that explore the upper limits of MARL capabilities for real-world applications and reveal key insights.

Paper Structure

This paper contains 38 sections, 25 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Responder agent finite state machine (FSM). The state 'select $v$' indicates selecting a victim.
  • Figure 2: Factorized deep Q-network (FDQN) architecture.
  • Figure 3: Heuristic policy comparison for the number of victims tagged over each time step for 100 victims and (a) 20 or (b) 80 responders. Each color curve denotes a different policy, and the dotted vertical line shows the average time it takes to tag all victims for each policy.
  • Figure 4: Graphs show results evaluating heuristics. (I) Number of victims tagged over time for 5, 20, and 80 responders. (II) Average time all victims are tagged for up to 100 victims for 5, 20, and 80 responders. (III) Average time all victims are tagged for 10, 20, and 100 victims for up to 5 responders. Columns (a-e) illustrate different policies. (IV) Responder agents' states over time. Four responders are shown for experiment 1 (5R, 10V). Parts (f-h) demonstrate different policies and the time when all victims are tagged ($t_{all}$).
  • Figure 5: Training graphs for FDQN in experiments R1-R8. The graphs show loss, reward values, number of steps until all victims are tagged, and the computational time in seconds over each training episode. The darker line represents a 71-point simple moving average, smoothing the raw data. The lighter shaded region indicates a $\pm$1 standard deviation from the original (un-smoothed) data.
  • ...and 1 more figures