PolicySimEval: A Benchmark for Evaluating Policy Outcomes through Agent-Based Simulation
Jiaju Kang, Puyu Han, Tian Zhang, Luqi Gong
TL;DR
PolicySimEval addresses the challenge of evaluating agent-based models for policy analysis by introducing a dedicated benchmark with end-to-end, sub-task, and auto-generated tasks. It provides a multi-dimensional evaluation framework and gold-standard solutions to assess ABMs’ accuracy, calibration, interpretability, and ethics in policy contexts. The benchmark includes 20 Comprehensive Scenarios, 65 Targeted Sub-tasks, and 200 Auto-generated Tasks, with metrics spanning $C_t$, $D_c$, $R_{cover}$, $E_b$, $H_c$, $T_{sim}$, $Q_l$, $R_e$, $C_e$, $A_r$, $T_a$, $T_r$, and $S_v$ to capture both outcomes and process quality. Experiments show state-of-the-art systems struggle on all task types, achieving at best around $24.5\%$, $15.04\%$, and $14.5\%$ coverage respectively, underscoring the need for methodological advances to bridge the gap to real-world policy evaluation.
Abstract
With the growing adoption of agent-based models in policy evaluation, a pressing question arises: Can such systems effectively simulate and analyze complex social scenarios to inform policy decisions? Addressing this challenge could significantly enhance the policy-making process, offering researchers and practitioners a systematic way to validate, explore, and refine policy outcomes. To advance this goal, we introduce PolicySimEval, the first benchmark designed to evaluate the capability of agent-based simulations in policy assessment tasks. PolicySimEval aims to reflect the real-world complexities faced by social scientists and policymakers. The benchmark is composed of three categories of evaluation tasks: (1) 20 comprehensive scenarios that replicate end-to-end policy modeling challenges, complete with annotated expert solutions; (2) 65 targeted sub-tasks that address specific aspects of agent-based simulation (e.g., agent behavior calibration); and (3) 200 auto-generated tasks to enable large-scale evaluation and method development. Experiments show that current state-of-the-art frameworks struggle to tackle these tasks effectively, with the highest-performing system achieving only 24.5\% coverage rate on comprehensive scenarios, 15.04\% on sub-tasks, and 14.5\% on auto-generated tasks. These results highlight the difficulty of the task and the gap between current capabilities and the requirements for real-world policy evaluation.
