Table of Contents
Fetching ...

Reward Engineering for Spatial Epidemic Simulations: A Reinforcement Learning Platform for Individual Behavioral Learning

Radman Rakhshandehroo, Daniel Coombs

TL;DR

ContagionRL presents a Gymnasium-compatible platform to study reward engineering in spatial epidemic simulations by integrating a parameterizable SIRS+D model with reinforcement learning for a single learning agent in a population of non-learning humans. The methodology enables systematic evaluation of reward designs (e.g., constant, infection-probability-based, and a dense Potential Field reward) across multiple RL algorithms (PPO, SAC, A2C) and environmental conditions, including partial observability. Key findings show that the Potential Field reward supports superior policy learning through directional guidance and adherence incentives, while simpler rewards can lead to suboptimal or myopic strategies; partial observability can unexpectedly improve robustness and performance. The work contributes a modular, configurable platform for dissecting reward-behavior relationships in spatial epidemics, with implications for designing behaviorally informed interventions and understanding how information structure shapes adaptive responses.

Abstract

We present ContagionRL, a Gymnasium-compatible reinforcement learning platform specifically designed for systematic reward engineering in spatial epidemic simulations. Unlike traditional agent-based models that rely on fixed behavioral rules, our platform enables rigorous evaluation of how reward function design affects learned survival strategies across diverse epidemic scenarios. ContagionRL integrates a spatial SIRS+D epidemiological model with configurable environmental parameters, allowing researchers to stress-test reward functions under varying conditions including limited observability, different movement patterns, and heterogeneous population dynamics. We evaluate five distinct reward designs, ranging from sparse survival bonuses to a novel potential field approach, across multiple RL algorithms (PPO, SAC, A2C). Through systematic ablation studies, we identify that directional guidance and explicit adherence incentives are critical components for robust policy learning. Our comprehensive evaluation across varying infection rates, grid sizes, visibility constraints, and movement patterns reveals that reward function choice dramatically impacts agent behavior and survival outcomes. Agents trained with our potential field reward consistently achieve superior performance, learning maximal adherence to non-pharmaceutical interventions while developing sophisticated spatial avoidance strategies. The platform's modular design enables systematic exploration of reward-behavior relationships, addressing a knowledge gap in models of this type where reward engineering has received limited attention. ContagionRL is an effective platform for studying adaptive behavioral responses in epidemic contexts and highlight the importance of reward design, information structure, and environmental predictability in learning.

Reward Engineering for Spatial Epidemic Simulations: A Reinforcement Learning Platform for Individual Behavioral Learning

TL;DR

ContagionRL presents a Gymnasium-compatible platform to study reward engineering in spatial epidemic simulations by integrating a parameterizable SIRS+D model with reinforcement learning for a single learning agent in a population of non-learning humans. The methodology enables systematic evaluation of reward designs (e.g., constant, infection-probability-based, and a dense Potential Field reward) across multiple RL algorithms (PPO, SAC, A2C) and environmental conditions, including partial observability. Key findings show that the Potential Field reward supports superior policy learning through directional guidance and adherence incentives, while simpler rewards can lead to suboptimal or myopic strategies; partial observability can unexpectedly improve robustness and performance. The work contributes a modular, configurable platform for dissecting reward-behavior relationships in spatial epidemics, with implications for designing behaviorally informed interventions and understanding how information structure shapes adaptive responses.

Abstract

We present ContagionRL, a Gymnasium-compatible reinforcement learning platform specifically designed for systematic reward engineering in spatial epidemic simulations. Unlike traditional agent-based models that rely on fixed behavioral rules, our platform enables rigorous evaluation of how reward function design affects learned survival strategies across diverse epidemic scenarios. ContagionRL integrates a spatial SIRS+D epidemiological model with configurable environmental parameters, allowing researchers to stress-test reward functions under varying conditions including limited observability, different movement patterns, and heterogeneous population dynamics. We evaluate five distinct reward designs, ranging from sparse survival bonuses to a novel potential field approach, across multiple RL algorithms (PPO, SAC, A2C). Through systematic ablation studies, we identify that directional guidance and explicit adherence incentives are critical components for robust policy learning. Our comprehensive evaluation across varying infection rates, grid sizes, visibility constraints, and movement patterns reveals that reward function choice dramatically impacts agent behavior and survival outcomes. Agents trained with our potential field reward consistently achieve superior performance, learning maximal adherence to non-pharmaceutical interventions while developing sophisticated spatial avoidance strategies. The platform's modular design enables systematic exploration of reward-behavior relationships, addressing a knowledge gap in models of this type where reward engineering has received limited attention. ContagionRL is an effective platform for studying adaptive behavioral responses in epidemic contexts and highlight the importance of reward design, information structure, and environmental predictability in learning.

Paper Structure

This paper contains 35 sections, 22 equations, 15 figures, 14 tables.

Figures (15)

  • Figure 1: ContagionRL System Architecture.Top: SIRS+D spatial epidemic environment with toroidal grid, configurable observability, and continuous agent control interface. Middle: Reward function design from sparse to potential field-based rewards. Bottom: Multi-dimensional experimental evaluation.
  • Figure 2: Episode duration distributions across different agents, including learning-based (PPO, SAC, A2C) and non-learning baselines (Random, Stationary, Greedy). Each small black dot represents one episode in the totality of episodes across 3 seeds × 100 evaluation runs. Per-seed means are shown as large black dots with white outlines. This figure complements the summary statistics in Figure \ref{['fig:figure4_barplot']} and statistical comparisons in Table \ref{['tab:figure4_mwu']}.
  • Figure 3: Comparison of PPO agent performance under five different reward functions. Each model was trained with three random seeds and evaluated over 100 episodes per seed (300 episodes total per reward function). Violin plots show the distribution of episode durations, overlaid with boxplots and per-episode results (black points). Large black dots with white outlines indicate the per-seed mean. One-sided Mann--Whitney U tests (with Bonferroni correction) compare each reward function to the Potential Field baseline. Statistically significant differences ($^*$$p < 0.05$, $^{**}$$p < 0.01$, $^{***}$$p < 0.001$) are annotated. See Table \ref{['tab:figure2_mwu_table']} for exact test values and Figure \ref{['fig:figure2_bar']} for corresponding means with confidence intervals.
  • Figure 4: Ablation study of the Potential Field reward function. Each variant was evaluated over 100 episodes across 3 training seeds (300 episodes total). Violin plots show the distribution of episode durations, overlaid with boxplots and individual episode results (small black dots). Large black dots with white outlines represent per-seed means. One-sided Mann--Whitney U tests (Bonferroni-corrected) compare each ablation to the full model (Full Potential Field), with significance annotations (* $p < 0.05$, ** $p < 0.01$, *** $p < 0.001$) shown for statistically significant differences. See Figure \ref{['fig:figure5_bar']} for aggregated means with confidence intervals, and Table \ref{['tab:figure5_mwu_table']} for full statistical test results.
  • Figure 5: Impact of visibility radius constraints on RL agent performance in epidemic control. The figure compares four agent types across different observation capabilities: Full Visibility (agent observes all infected individuals), and Limited Visibility with radius constraints r=10, r=15, and r=20 (agent only observes infected individuals within the specified radius). Agent types include: Stationary (no movement), Random (random actions), Greedy (heuristic policy avoiding nearest infected), and Trained (PPO-trained RL agents with respective visibility constraints). Results averaged across 3 random seeds with 100 evaluation episodes per seed (N=300 per condition). Error bars show 95% bootstrap confidence intervals from per-seed means. Left: Average cumulative reward per episode. Center: Episode length (survival time in timesteps). Right: Infections per timestep calculated from total infections divided by episode length. Hatching patterns distinguish trained variants: dots (r=10), diagonal lines (r=15), and crosses (r=20). Limited visibility agents (r=15, r=20) achieve higher performance than full visibility, suggesting that observation constraints can improve learning by reducing observation noise and focusing attention on nearby threats.
  • ...and 10 more figures