Table of Contents
Fetching ...

What Makes LLM Agent Simulations Useful for Policy Practice? An Iterative Design Study in Emergency Preparedness

Yuxuan Li, Sauvik Das, Hirokazu Shirado

TL;DR

The paper addresses how LLM agent simulations can be meaningfully integrated into policy practice under deep uncertainty, using a year-long, stakeholder-engaged design study in emergency preparedness. It adopts an iterative, co-design approach with a university team, culminating in a stadium-scale simulation of ~13,000 agents that informs training, evacuation procedures, and infrastructure planning rather than predicting exact outcomes. The authors identify five design mechanisms—validation, trust bootstrapping, surface-of- tacit knowledge via fix-it responses, attention to contextual details, and policy–AI co-evolution—that explain how usefulness emerges in real-world policy contexts. The work argues for moving beyond raw model fidelity toward institutional alignment, showcasing how simulations can serve as technology probes that support practical sensemaking, planning, and iterative policy refinement with stakeholders.

Abstract

Policymakers must often act under conditions of deep uncertainty, such as emergency response, where predicting the specific impacts of a policy apriori is implausible. Large Language Model (LLM) agent simulations have been proposed as tools to support policymakers under these conditions, yet little is known about how such simulations become useful for real-world policy practice. To address this gap, we conducted a year-long, stakeholder-engaged design process with a university emergency preparedness team. Through iterative design cycles, we developed and refined an LLM agent simulation of a large-scale campus gathering, ultimately scaling to 13,000 agents that modeled crowd movement and communication under various emergency scenarios. Rather than producing predictive forecasts, these simulations supported policy practice by shaping volunteer training, evacuation procedures, and infrastructure planning. Analyzing these findings, we identify three design process implications for making LLM agent simulations that are useful for policy practice: start from verifiable scenarios to bootstrap trust, use preliminary simulations to elicit tacit domain knowledge, and treat simulation capabilities and policy implementation as co-evolving.

What Makes LLM Agent Simulations Useful for Policy Practice? An Iterative Design Study in Emergency Preparedness

TL;DR

The paper addresses how LLM agent simulations can be meaningfully integrated into policy practice under deep uncertainty, using a year-long, stakeholder-engaged design study in emergency preparedness. It adopts an iterative, co-design approach with a university team, culminating in a stadium-scale simulation of ~13,000 agents that informs training, evacuation procedures, and infrastructure planning rather than predicting exact outcomes. The authors identify five design mechanisms—validation, trust bootstrapping, surface-of- tacit knowledge via fix-it responses, attention to contextual details, and policy–AI co-evolution—that explain how usefulness emerges in real-world policy contexts. The work argues for moving beyond raw model fidelity toward institutional alignment, showcasing how simulations can serve as technology probes that support practical sensemaking, planning, and iterative policy refinement with stakeholders.

Abstract

Policymakers must often act under conditions of deep uncertainty, such as emergency response, where predicting the specific impacts of a policy apriori is implausible. Large Language Model (LLM) agent simulations have been proposed as tools to support policymakers under these conditions, yet little is known about how such simulations become useful for real-world policy practice. To address this gap, we conducted a year-long, stakeholder-engaged design process with a university emergency preparedness team. Through iterative design cycles, we developed and refined an LLM agent simulation of a large-scale campus gathering, ultimately scaling to 13,000 agents that modeled crowd movement and communication under various emergency scenarios. Rather than producing predictive forecasts, these simulations supported policy practice by shaping volunteer training, evacuation procedures, and infrastructure planning. Analyzing these findings, we identify three design process implications for making LLM agent simulations that are useful for policy practice: start from verifiable scenarios to bootstrap trust, use preliminary simulations to elicit tacit domain knowledge, and treat simulation capabilities and policy implementation as co-evolving.

Paper Structure

This paper contains 46 sections, 9 figures, 7 algorithms.

Figures (9)

  • Figure 1: Simulation overview. (A) LLM agents conceptual diagram: Agents receive inputs from three sources: Personal (e.g., accessibility needs, group membership, persona), Social (group chat, official instructions, social network), and Environmental (stadium overview, local layout, neighbor mobility). For each agent, an LLM governs decision-making and communication and hands off to a rule-based controller for embodied action. Outputs include Physical (moving/navigating, exiting), Decision (destination choice, regrouping), and Communication (ongoing group chat). (B) System display: Stadium-scale interface showing physical layouts, agents as colored dots, coordinators, and a per-agent side panel. Numbered callouts: (1) stage area; (2) students with accessibility needs; (3) seating areas for a portion of students' family and friends; (4) seating sections for students from different majors, each major marked by a distinct color; (5) exits; (6) bleacher area with dense family-and-friends seating; (7) coordinators distributed around the field track for directing flow; (8) a per-agent side panel showing persona attributes including name, major and profile, below are group-chat messages between agents that have social relationships and within the same group. Together this system supports simulations with tens of thousands of agents (up to 13k in our largest runs).
  • Figure 2: Iterative design process. Across 16 months (May 2024–Aug 2025), we progressed from preparation interviews through five simulation iterations and an in-situ observation. Each phase introduced greater scale and realism (from 100 to 13,000 agents, adding roles, bottlenecks, and family dynamics) and was shaped by policymakers’ feedback. The core system used in later studies (Fig. \ref{['fig:system']}) was developed by the fourth iteration. Early iterations surfaced distrust and missing realism, later iterations built credibility and training value, and by the final iteration, simulations informed adopted protocols, feasibility assessments, and official after-action reports.
  • Figure 3: Comparison of crowd movement dynamics between empirical observation and simulation. (A) Photograph of the 2025 commencement, showing attendee movement as the ceremony concluded under routine (non-emergency) conditions. (B) Snapshot from the corresponding simulation, incorporating 13,000 agents with roles for students, family members, and coordinators, represented (colored dots). The simulation (Fig. \ref{['fig:system']}) reproduced key spatial patterns observed in the event, including congestion in the track areas at the bottom of the frame, supporting alignment between observed and simulated crowd dynamics.
  • Figure 4: Cumulative evacuation progress across policy alternatives in a severe weather scenario. The figure shows the simulated proportion of agents evacuated over time after an official severe weather announcement instructing evacuation from the stadium. Each curve represents the cumulative percentage of agents who have exited under the current procedure and three alternative strategies explored with the developed stadium system (Fig. \ref{['fig:system']}). Vertical dashed lines indicate the time required to evacuate 80% of agents—a threshold that policymakers routinely use to assess evacuation effectiveness. Opening the northwest exit (green) significantly reduces evacuation time relative to the current process (yellow), corresponding to a 23.0% reduction at the 80% threshold, while repositioning coordinators or emphasizing the south exit yields only marginal gains. These comparative trajectories supported policymakers’ reasoning about which procedural changes were likely to meaningfully improve evacuation efficiency.
  • Figure A1: Developers and policymakers P1 collaboratively co-creating a stakeholder map and process map during a policy meeting. The session supported knowledge elicitation of roles, responsibilities, and workflows in emergency preparedness.
  • ...and 4 more figures