Scalable calibration of individual-based epidemic models through categorical approximations

Lorenzo Rimella; Nick Whiteley; Chris Jewell; Paul Fearnhead; Michael Whitehouse

Scalable calibration of individual-based epidemic models through categorical approximations

Lorenzo Rimella, Nick Whiteley, Chris Jewell, Paul Fearnhead, Michael Whitehouse

TL;DR

This work tackles the intractability of exact likelihoods for partially observed IBMs in epidemiology by introducing CAL, a deterministic, simulation-free categorical approximation that enables automatic differentiation and scalable inference. CAL replaces the latent population state at time $t-1$ with its expected value to compute a recursive, tractable likelihood, followed by a correction step using the exact emission model; it can be interpreted as the exact likelihood of an approximate model. The authors prove strong consistency of the maximum CAL estimator in the large-population limit and demonstrate the method across SIS/SIR IBMs, including a large-scale real-world outbreak with over $10^5$ farms, using gradient-based optimization and HMC in TensorFlow. Empirically, CAL achieves ground-truth recovery and competitive marginal log-likelihoods at substantially reduced computational cost compared with SMC variants, and scales to 162,775 farms in the UK outbreak, highlighting its practical impact for real-time, large-scale epidemic calibration. The work also discusses limitations (e.g., independence assumptions across individuals) and outlines future avenues, such as household-structured extensions and time-varying covariates, to broaden applicability and robustness.

Abstract

Traditional compartmental models capture population-level dynamics but fail to characterize individual-level risk. The computational cost of exact likelihood evaluation for partially observed individual-based models, however, grows exponentially with the population size, necessitating approximate inference. Existing sampling-based methods usually require multiple simulations of the individuals in the population and rely on bespoke proposal distributions or summary statistics. We propose a deterministic approach to approximating the likelihood using categorical distributions. The approximate likelihood is amenable to automatic differentiation so that parameters can be estimated by maximization or posterior sampling using standard software libraries such as Stan or TensorFlow with little user effort. We prove the consistency of the maximum approximate likelihood estimator. We empirically test our approach on several classes of individual-based models for epidemiology: different sets of disease states, individual-specific transition rates, spatial interactions, under-reporting and misreporting. We demonstrate ground truth recovery and comparable marginal log-likelihood values at substantially reduced cost compared to competitor methods. Finally, we show the scalability and effectiveness of our approach with a real-world application on the 2001 UK Foot-and-Mouth outbreak, where the simplicity of the CAL allows us to include 162775 farms.

Scalable calibration of individual-based epidemic models through categorical approximations

TL;DR

with its expected value to compute a recursive, tractable likelihood, followed by a correction step using the exact emission model; it can be interpreted as the exact likelihood of an approximate model. The authors prove strong consistency of the maximum CAL estimator in the large-population limit and demonstrate the method across SIS/SIR IBMs, including a large-scale real-world outbreak with over

farms, using gradient-based optimization and HMC in TensorFlow. Empirically, CAL achieves ground-truth recovery and competitive marginal log-likelihoods at substantially reduced computational cost compared with SMC variants, and scales to 162,775 farms in the UK outbreak, highlighting its practical impact for real-time, large-scale epidemic calibration. The work also discusses limitations (e.g., independence assumptions across individuals) and outlines future avenues, such as household-structured extensions and time-varying covariates, to broaden applicability and robustness.

Abstract

Paper Structure (84 sections, 33 theorems, 356 equations, 13 figures, 9 tables, 4 algorithms)

This paper contains 84 sections, 33 theorems, 356 equations, 13 figures, 9 tables, 4 algorithms.

Introduction
Motivating example
Heterogeneous attributes.
Homogeneous- and heterogeneous-mixing dynamics.
Observations.
Inference challenges.
Related work
Individual-based compartmental model
Notation
Model
Latent dynamics
Observations
Motivating example
Homogeneous- and heterogeneous-mixing dynamics.
Observation model.
...and 69 more sections

Key Result

Theorem 1

Let Assumptions ass:main_compactness_continuity,ass:main_w_iid,ass:main_HMM_support,ass:main_eta_structure,ass:main_kernel_continuity hold and let $\hat{ {\theta}}_N$ be a maximizer of $\ell_{1:T}^N({\theta})$. Then $\hat{ {\theta}}_N$ converges to $\Theta^\star$ as $N \to \infty$, $\mathbb{P}$-almo

Figures (13)

Figure 1: Trace plots for HMC under different population sizes. Solid red lines denote the DGP.
Figure 2: A realization of the latent process from Model 1 (first row) and Model 2 (second row) when $N=1000$. Different columns are associated with different time steps. For Model 1, blue and red dots refer to susceptible and infected individuals, respectively. For Model 2, the communities are blue circles with a radius that is proportional to their population, while the red circles are proportional to the number of infected inside the communities.
Figure 3: CAL filtering for $t=5,10,20,50$ under the well-specified and misspecified scenario. Rows from top to bottom: observed data, true latent disease states, and inferred latent disease states from CAL filtering under the well-specified and misspecified models. The yellow dots are used for unreported individuals, while blue, red, black are susceptible, infected, removed.
Figure 4: The farms and the local authorities included in the study. On the left, dots represent farms, with red indicating that the farm was reported infected at some point in time. On the right, the local authorities are blue circles with a radius proportional to the number of farms. Red inner circles are proportional to the number of farms within the local authority that were reported infected during the outbreak. Black contours represent the geometries of the local authorities.
Figure 5: On the left and in the center are the heat maps of the inferred susceptibility and infectivity. On the right is the inferred spatial kernel effect on both infection and culling as a function of the distance in kilometers. The solid lines represent posterior means and the shaded bands represent $95\%$ credible intervals.
...and 8 more figures

Theorems & Definitions (67)

Theorem 1
Proposition 2
proof
Proposition 3
proof
Proposition 4
proof
Lemma 5
proof
Proposition 6
...and 57 more

Scalable calibration of individual-based epidemic models through categorical approximations

TL;DR

Abstract

Scalable calibration of individual-based epidemic models through categorical approximations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (67)