Table of Contents
Fetching ...

Pandemics In Silico: Scaling an Agent-Based Simulation on Realistic Social Contact Networks

Joy Kitson, Ian Costello, Jiangzhuo Chen, Diego Jiménez, Stefan Hoops, Henning Mortveit, Esteban Meneses, Jae-Seung Yeom, Madhav V. Marathe, Abhinav Bhatele

TL;DR

Loimos addresses the need for fast, scalable agent-based epidemic simulations on realistic social contact networks. It introduces a hybrid discrete-event/time-stepping framework built on Charm++ to model contagion across a population–location graph, with modular disease, intervention, and performance-optimization capabilities. The study validates Loimos against EpiHiper and demonstrates strong and weak scaling on the Perlmutter supercomputer, achieving rapid, large-scale simulations (e.g., 200 days in ~42 seconds on 4096 cores) and identifying optimizations that substantially reduce runtime. The work has practical impact for policy analysis by enabling rapid exploration of intervention scenarios at national to regional scales on HPC resources.

Abstract

Preventing the spread of infectious diseases requires implementing interventions at various levels of government and evaluating the potential impact and efficacy of those preemptive measures. Agent-based modeling can be used for detailed studies of epidemic diffusion and possible interventions. Modeling of epidemic diffusion in large social contact networks requires the use of parallel algorithms and resources. In this work, we present Loimos, a scalable parallel framework for simulating epidemic diffusion. Loimos uses a hybrid of time-stepping and discrete-event simulation to model disease spread, and is implemented on top of an asynchronous, many-task runtime. We demonstrate that Loimos is to able to achieve significant speedups while scaling to large core counts. In particular, Loimos is able to simulate 200 days of a COVID-19 outbreak on a digital twin of California in about 42 seconds, for an average of 4.6 billion traversed edges per second (TEPS), using 4096 cores on Perlmutter at NERSC.

Pandemics In Silico: Scaling an Agent-Based Simulation on Realistic Social Contact Networks

TL;DR

Loimos addresses the need for fast, scalable agent-based epidemic simulations on realistic social contact networks. It introduces a hybrid discrete-event/time-stepping framework built on Charm++ to model contagion across a population–location graph, with modular disease, intervention, and performance-optimization capabilities. The study validates Loimos against EpiHiper and demonstrates strong and weak scaling on the Perlmutter supercomputer, achieving rapid, large-scale simulations (e.g., 200 days in ~42 seconds on 4096 cores) and identifying optimizations that substantially reduce runtime. The work has practical impact for policy analysis by enabling rapid exploration of intervention scenarios at national to regional scales on HPC resources.

Abstract

Preventing the spread of infectious diseases requires implementing interventions at various levels of government and evaluating the potential impact and efficacy of those preemptive measures. Agent-based modeling can be used for detailed studies of epidemic diffusion and possible interventions. Modeling of epidemic diffusion in large social contact networks requires the use of parallel algorithms and resources. In this work, we present Loimos, a scalable parallel framework for simulating epidemic diffusion. Loimos uses a hybrid of time-stepping and discrete-event simulation to model disease spread, and is implemented on top of an asynchronous, many-task runtime. We demonstrate that Loimos is to able to achieve significant speedups while scaling to large core counts. In particular, Loimos is able to simulate 200 days of a COVID-19 outbreak on a digital twin of California in about 42 seconds, for an average of 4.6 billion traversed edges per second (TEPS), using 4096 cores on Perlmutter at NERSC.
Paper Structure (21 sections, 3 equations, 9 figures, 3 tables, 2 algorithms)

This paper contains 21 sections, 3 equations, 9 figures, 3 tables, 2 algorithms.

Figures (9)

  • Figure 1: Visualizations generated using the Projections tool showing processor utilization in three iterations of Loimos on two Perlmutter nodes (256 cores) with MI data. Note that there is much more idle time on a larger number of cores with no optimizations (left), than with static load balancing only (right).
  • Figure 2: Strong scaling comparison of the performance of different symmetric multiprocessing (SMP) configurations on Perlmutter with MI data. All four SMP configurations -- with different processes per node (p/n) and worker threads per process (t/p) counts -- perform worse than the non-SMP configuration for all core counts. Execution times are averaged over three runs, with extrema shown in error bars.
  • Figure 3: Visualizations generated using the Projections tool showing the breakdown of time spent in three iterations of Loimos on two Perlmutter nodes (256 cores) with MI data. We observe that most of the time is spent in the person state communication (PSC) phase with only static load balancing and short circuit evaluation of interactions (left), but negligible time doing so when storing visits on location chares (right). In the latter case, the time spent in the exposure computation and communication phase (ECC) dominates, and the total execution time is significantly reduced.
  • Figure 4: A 200-day simulation of Loimos on two Perlmutter nodes (256 cores) with MI data. We observe that time spent on exposure computation and communication is much greater without (static $t_\text{ECC}$) than with (static+sc $t_\text{ECC}$) short circuit evaluation of interactions. In the latter case, execution time is highest when cases (infections) are increasing fastest.
  • Figure 5: Performance impact of adding the following optimizations over the original Loimos implementation (no-opts): (1) static load balancing (static), (2) short circuit evaluation of interactions (sc), and (3) storing visit data on location chares (loc-visits). Each added optimization reduces runtimes, which are averaged over three runs on the MI data on Perlmutter with extrema shown in error bars.
  • ...and 4 more figures