Table of Contents
Fetching ...

Accurately Estimating Unreported Infections using Information Theory

Jiaming Cui, Bijaya Adhikari, Arash Haddadan, A S M Ahsan-Ul Haque, Jilles Vreeken, Anil Vullikanti, B. Aditya Prakash

TL;DR

This work tackles the challenge of estimating unreported infections in epidemics by introducing MdlInfer, an information-theoretic approach that operates on top of traditional ODE-based epidemiological models. Framed through Minimum Description Length, it seeks the model $\text{Model}=(D,\Theta',\hat{\Theta})$ that minimizes the total description length $L(D,\Theta',\hat{\Theta})+L(D_{\mathrm{reported}}|D,\Theta',\hat{\Theta})$, thereby jointly estimating the total infections $D$ and a candidate reported rate $\alpha_{\mathrm{reported}}'$. Through two-step optimization (first estimating $\alpha_{\mathrm{reported}}^*$ and then solving for $D^*$), MdlInfer achieves total infection estimates closer to serological benchmarks and improves forecasting of reported infections and symptomatic-rate trends across SAPHIRE and SEIR+HD models. The method also enables counterfactual non-pharmaceutical interventions and emphasizes that NPIs targeting asymptomatic/presymptomatic transmission are essential for effective epidemic control. Overall, MdlInfer provides a principled, generalizable framework for enhanced epidemic modeling with potential broad applicability beyond COVID-19.

Abstract

One of the most significant challenges in combating against the spread of infectious diseases was the difficulty in estimating the true magnitude of infections. Unreported infections could drive up disease spread, making it very hard to accurately estimate the infectivity of the pathogen, therewith hampering our ability to react effectively. Despite the use of surveillance-based methods such as serological studies, identifying the true magnitude is still challenging. This paper proposes an information theoretic approach for accurately estimating the number of total infections. Our approach is built on top of Ordinary Differential Equations (ODE) based models, which are commonly used in epidemiology and for estimating such infections. We show how we can help such models to better compute the number of total infections and identify the parametrization by which we need the fewest bits to describe the observed dynamics of reported infections. Our experiments on COVID-19 spread show that our approach leads to not only substantially better estimates of the number of total infections but also better forecasts of infections than standard model calibration based methods. We additionally show how our learned parametrization helps in modeling more accurate what-if scenarios with non-pharmaceutical interventions. Our approach provides a general method for improving epidemic modeling which is applicable broadly.

Accurately Estimating Unreported Infections using Information Theory

TL;DR

This work tackles the challenge of estimating unreported infections in epidemics by introducing MdlInfer, an information-theoretic approach that operates on top of traditional ODE-based epidemiological models. Framed through Minimum Description Length, it seeks the model that minimizes the total description length , thereby jointly estimating the total infections and a candidate reported rate . Through two-step optimization (first estimating and then solving for ), MdlInfer achieves total infection estimates closer to serological benchmarks and improves forecasting of reported infections and symptomatic-rate trends across SAPHIRE and SEIR+HD models. The method also enables counterfactual non-pharmaceutical interventions and emphasizes that NPIs targeting asymptomatic/presymptomatic transmission are essential for effective epidemic control. Overall, MdlInfer provides a principled, generalizable framework for enhanced epidemic modeling with potential broad applicability beyond COVID-19.

Abstract

One of the most significant challenges in combating against the spread of infectious diseases was the difficulty in estimating the true magnitude of infections. Unreported infections could drive up disease spread, making it very hard to accurately estimate the infectivity of the pathogen, therewith hampering our ability to react effectively. Despite the use of surveillance-based methods such as serological studies, identifying the true magnitude is still challenging. This paper proposes an information theoretic approach for accurately estimating the number of total infections. Our approach is built on top of Ordinary Differential Equations (ODE) based models, which are commonly used in epidemiology and for estimating such infections. We show how we can help such models to better compute the number of total infections and identify the parametrization by which we need the fewest bits to describe the observed dynamics of reported infections. Our experiments on COVID-19 spread show that our approach leads to not only substantially better estimates of the number of total infections but also better forecasts of infections than standard model calibration based methods. We additionally show how our learned parametrization helps in modeling more accurate what-if scenarios with non-pharmaceutical interventions. Our approach provides a general method for improving epidemic modeling which is applicable broadly.

Paper Structure

This paper contains 59 sections, 32 equations, 11 figures, 2 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview of our problem and methodology. (A) We visualize the idea of reported rates using the iceberg. The visible portion above water are the reported infections, which is only a fraction of the whole iceberg representing total infections. Light green corresponds to the 182 unreported infections estimated by typical current practice used by researchers. We call it as the basic approach, or $\textsc{BaseInfer}$. In contrast, dark green corresponds to the more accurate and much larger 301 unreported infections found by our approach $\textsc{MdlInfer}$. (B) The usual practice is to calibrate an epidemiological model to reported data and compute the reported rate from the resultant parameterization of the model. Here, an SEIR-style model with explicit compartments for reported-vs-unreported infection is shown in the figure as an example. (C) Our new approach $\textsc{MdlInfer}$ instead aims to compute a more accurate reported rate by finding a 'best' parametrization for the same epidemiological model (i.e., SEIR-style model in this example) using a principled information theoretic formulation - two-part 'sender-receiver' framework. Assume that a hypothetical Sender S wants to transmit the reported infections as the $\textsc{Data}$ to a Receiver R in the cheapest way possible. Hence S will find/solve for the best $D^*$, intuitively, the $\textsc{Model}$ that takes the fewest number of bits to encode the $\textsc{Data}$. Using $D^*$, we can find the best $\Theta^*$ by exploring a smaller search space.
  • Figure 3: $\textsc{MdlInfer}$ (red) gives a closer estimation of total infections to serological studies (black) than $\textsc{BaseInfer}$ (blue) on various geographical regions and time periods. Note that both approaches try to fit the serological studies without being informed with them. (A)-(H) The red and blue curves represent $\textsc{MdlInfer}$'s estimation of total infections, $\textsc{MdlParam}\xspace_{\mathrm{Tinf}}$, and $\textsc{BaseInfer}$'s estimation of total infections, $\textsc{BaseParam}\xspace_{\mathrm{Tinf}}$, respectively. The black point estimates and confidence intervals represent the total infections estimated by serological studies CDCTrackerhavers2020seroprevalence, $\textsc{SeroStudy}_{\mathrm{Tinf}}$. (A)-(D) use $\mathrm{SAPHIRE}$ model and (E)-(H) use $\mathrm{SEIR+HD}$ model. (I)-(J) The performance metric, $\rho_{\mathrm{Tinf}}$, comparing $\textsc{MdlParam}\xspace_{\mathrm{Tinf}}$ against $\textsc{BaseParam}\xspace_{\mathrm{Tinf}}$ in fitting serological studies is shown for each region. (I) is for $\mathrm{SAPHIRE}$ model in (A)-(D), and (J) is for $\mathrm{SEIR+HD}$ model in (E)-(H). Here, the values of $\rho_{\mathrm{Tinf}}$ are 1.20, 5.47, 7.21, and 1.79 in (I), and 2.62 ,1.22, 6.39, and 1.58 in (J). Note that $\rho_{\mathrm{Tinf}}$ larger than 1 means that $\textsc{MdlParam}\xspace_{\mathrm{Tinf}}$ is closer to $\textsc{SeroStudy}_{\mathrm{Tinf}}$ than $\textsc{BaseParam}\xspace_{\mathrm{Tinf}}$. We show more experiments in the Appendix.
  • Figure 4: $\textsc{MdlInfer}$ (red) gives a closer estimation of reported infections (black) than $\textsc{BaseInfer}$ (blue) on various geographical regions and time periods. We use the reported infections in the observed period as inputs and try to forecast the future reported infections (forecast period). (A)-(H) The vertical grey dash line divides the observed period (left) and forecast period (right). The red and blue curves represent $\textsc{MdlInfer}$'s estimation of reported infections, $\textsc{MdlParam}\xspace_{\mathrm{Rinf}}$, and $\textsc{BaseInfer}$'s estimation of reported infections, $\textsc{BaseParam}\xspace_{\mathrm{Rinf}}$, respectively. The black plus symbols represent the reported infections collected by the New York Times ($\textsc{NYT-R}\mathrm{inf}$). (A)-(D) use $\mathrm{SAPHIRE}$ model and (E)-(H) use $\mathrm{SEIR+HD}$ model. (I)-(J) The performance metric, $\rho_{\mathrm{Rinf}}$, comparing $\textsc{MdlParam}\xspace_{\mathrm{Rinf}}$ against $\textsc{BaseParam}\xspace_{\mathrm{Rinf}}$ in fitting reported infections is shown for each region. (I) is for $\mathrm{SAPHIRE}$ model in (A)-(D), and (J) is for $\mathrm{SEIR+HD}$ model in (E)-(H). Note that $\rho_{\mathrm{Rinf}}$ larger than 1 means that $\textsc{MdlParam}\xspace_{\mathrm{Rinf}}$ is closer to $\textsc{NYT-R}\mathrm{inf}$ than $\textsc{BaseParam}\xspace_{\mathrm{Rinf}}$. We show more experiments in the Appendix.
  • Figure 5: $\textsc{MdlInfer}$ (red) gives a closer estimation of the trends of symptomatic rate (black) than $\textsc{BaseInfer}$ (blue) on various geographical regions and time periods. (A)-(D) The red and blue curves represent $\textsc{MdlInfer}$'s estimation of symptomatic rate, $\textsc{MdlParam}\xspace_{\mathrm{Symp}}$, and $\textsc{BaseInfer}$'s estimation of symptomatic rate, $\textsc{BaseParam}\xspace_{\mathrm{Symp}}$, respectively. They use the y-scale on the left. The black points and the shaded regions are the point estimate with standard error for $\textsc{Rate}_{\mathrm{Symp}}$ (the COVID-related symptomatic rates derived from the symptomatic surveillance dataset delphisurveysalomon2021us). They use the y-scale on the right. Note that we focus on trends instead of the exact numbers, hence $\textsc{MdlParam}\xspace_{\mathrm{Symp}}$/$\textsc{BaseParam}\xspace_{\mathrm{Symp}}$, and $\textsc{Rate}_{\mathrm{Symp}}$ may scale differently. We show more experiments in the Appendix.
  • Figure 6: (A) $\textsc{MdlInfer}$ reveals that non-pharmaceutical interventions (NPI) on asymptomatic and presymptomatic infections are essential to control the COVID-19 epidemic. Here, the red curve and other five curves represent the $\textsc{MdlInfer}$'s estimation of reported infections for no NPI scenario and 5 different NPI scenarios described in the Results section. The vertical grey dash line divides the observed period (left) and forecast period (right). (B) Inaccurate estimation by $\textsc{BaseInfer}$ may lead to wrong NPI conclusions. The blue curve and other five curves represent the $\textsc{BaseInfer}$'s estimation of reported infections for no NPI scenario and the same 5 scenarios in (B).
  • ...and 6 more figures