Table of Contents
Fetching ...

Emergency Preemption Without Online Exploration: A Decision Transformer Approach

Haoran Su, Hanxiao Deng, Yandong Sun

Abstract

Emergency vehicle (EV) response time is a critical determinant of survival outcomes, yet deployed signal preemption strategies remain reactive and uncontrollable. We propose a return-conditioned framework for emergency corridor optimization based on the Decision Transformer (DT). By casting corridor optimization as offline, return-conditioned sequence modeling, our approach (1) eliminates online environment interaction during policy learning, (2) enables dispatch-level urgency control through a single target-return scalar, and (3) extends to multi-agent settings via a Multi-Agent Decision Transformer (MADT) with graph attention for spatial coordination. On the LightSim simulator, DT reduces average EV travel time by 37.7% relative to fixed-timing preemption on a 4x4 grid (88.6 s vs. 142.3 s), achieving the lowest civilian delay (11.3 s/veh) and fewest EV stops (1.2) among all methods, including online RL baselines that require environment interaction. MADT further improves on larger grids, overtaking DT with 45.2% reduction on 8x8 via graph-attention coordination. Return conditioning produces a smooth dispatch interface: varying the target return from 100 to -400 trades EV travel time (72.4-138.2 s) against civilian delay (16.8-5.4 s/veh), requiring no retraining. A Constrained DT extension adds explicit civilian disruption budgets as a second control knob.

Emergency Preemption Without Online Exploration: A Decision Transformer Approach

Abstract

Emergency vehicle (EV) response time is a critical determinant of survival outcomes, yet deployed signal preemption strategies remain reactive and uncontrollable. We propose a return-conditioned framework for emergency corridor optimization based on the Decision Transformer (DT). By casting corridor optimization as offline, return-conditioned sequence modeling, our approach (1) eliminates online environment interaction during policy learning, (2) enables dispatch-level urgency control through a single target-return scalar, and (3) extends to multi-agent settings via a Multi-Agent Decision Transformer (MADT) with graph attention for spatial coordination. On the LightSim simulator, DT reduces average EV travel time by 37.7% relative to fixed-timing preemption on a 4x4 grid (88.6 s vs. 142.3 s), achieving the lowest civilian delay (11.3 s/veh) and fewest EV stops (1.2) among all methods, including online RL baselines that require environment interaction. MADT further improves on larger grids, overtaking DT with 45.2% reduction on 8x8 via graph-attention coordination. Return conditioning produces a smooth dispatch interface: varying the target return from 100 to -400 trades EV travel time (72.4-138.2 s) against civilian delay (16.8-5.4 s/veh), requiring no retraining. A Constrained DT extension adds explicit civilian disruption budgets as a second control knob.
Paper Structure (103 sections, 20 equations, 10 figures, 16 tables, 2 algorithms)

This paper contains 103 sections, 20 equations, 10 figures, 16 tables, 2 algorithms.

Figures (10)

  • Figure 1: Network topology of the $4{\times}4$ grid used in our primary experiments. Red nodes and thick red edges indicate a representative EV corridor ($K{=}7$ intersections). Each link spans 300 m, discretized into 4 CTM cells of length $\ell = v_f \cdot \Delta t = 75$ m. The EV originates at $v_4$ (northwest) and traverses to $v_{13}$ (southeast), encountering 7 signalized intersections along its route. Background intersections (gray) operate under the same signal controller but are not part of the EV corridor. Traffic enters and exits the network at all boundary nodes with configurable demand rates (default: 0.10 veh/s per entry point).
  • Figure 2: Architecture overview of the three proposed models. Left: The single-agent DT processes interleaved (return-to-go, state, action) tokens through a causal GPT-style transformer ($L{=}4$ layers, $N_H{=}4$ heads) to predict per-intersection phase logits. Right: MADT enriches state embeddings via a 2-layer GAT that aggregates neighbor information before the transformer, enabling decentralized inter-intersection coordination. CDT extends the token sequence with a cost-to-go token per timestep (Section \ref{['sec:cdt']}), providing a second dispatch control knob.
  • Figure 3: Training pipeline for the DT-based EV corridor optimizer. Data is collected in LightSim using a mixed-quality behavioral policy (70% expert, 15% random, 15% noisy), producing 5,000 episodes with broad return coverage. The offline dataset is tokenized into (return-to-go, state, action) sequences and used to train the DT via cross-entropy minimization for 100 epochs. At deployment, the trained model receives a dispatcher-specified target return $G^\star$ and autoregressively generates phase commands at 2.3 ms per step, well within the 5 s control cycle.
  • Figure 4: Performance comparison of nine methods on the $4{\times}4$ grid (100 evaluation episodes). DT (ours) achieves the lowest EV travel time (88.6 s), lowest civilian delay (11.3 s/veh), and fewest EV stops (1.2), all without online environment interaction. Error bars show $\pm$1 standard deviation. Numerical values are in Table \ref{['tab:main']}.
  • Figure 5: Return conditioning sweep on the $4{\times}4$ grid (100 episodes per $G^\star$ value). Varying $G^\star$ from 100 (aggressive corridor) to $-400$ (conservative) produces a smooth, monotonic trade-off: ETT ranges from 72.4 s to 138.2 s while ACD ranges from 16.8 to 5.4 s/veh. The shaded region marks the operational sweet spot ($G^\star \in [-100, 0]$) where both metrics are near-optimal. See Table \ref{['tab:conditioning']} for exact values.
  • ...and 5 more figures