Table of Contents
Fetching ...

Not All Accuracy Is Equal: Prioritizing Independence in Infectious Disease Forecasting

Carson Dudley, Marisa Eisenberg

TL;DR

A toy example illustrating the theoretical cost of correlated errors is presented, correlations among COVID-19 forecasting models are analyzed, and improvements to model fitting and ensemble construction that foster genuine diversity are proposed.

Abstract

Ensemble forecasts have become a cornerstone of large-scale disease response, underpinning decision making at agencies such as the US Centers for Disease Control and Prevention (CDC). Their growing use reflects the goal of combining multiple models to improve accuracy and stability versus relying on any single model. However, while ensembles regularly demonstrate stability against individual model failures, improved accuracy is not guaranteed. During the COVID-19 pandemic, the CDC's multi-model ensemble outperformed the best single model by only 1\%, and CDC flu ensembles have often ranked below individual models. Prior work has established that ensemble performance depends critically on diversity: when models make independent errors, combining them yields substantial gains. In practice, however, this diversity is often lacking. Here, we propose that this is due in part to how models are developed and selected: both modelers and ensemble builders optimize for stand-alone accuracy rather than ensemble contribution, and most epidemic forecasts are built from a small set of approaches trained on the same surveillance data. The result is highly correlated errors, limiting the benefit of ensembling. This suggests that in developing models and ensembles, we should prioritize models that contribute complementary information rather than replicating existing approaches. We present a toy example illustrating the theoretical cost of correlated errors, analyze correlations among COVID-19 forecasting models, and propose improvements to model fitting and ensemble construction that foster genuine diversity. Ensembles built with this principle in mind produce forecasts that are more robust and more valuable for epidemic preparedness and response.

Not All Accuracy Is Equal: Prioritizing Independence in Infectious Disease Forecasting

TL;DR

A toy example illustrating the theoretical cost of correlated errors is presented, correlations among COVID-19 forecasting models are analyzed, and improvements to model fitting and ensemble construction that foster genuine diversity are proposed.

Abstract

Ensemble forecasts have become a cornerstone of large-scale disease response, underpinning decision making at agencies such as the US Centers for Disease Control and Prevention (CDC). Their growing use reflects the goal of combining multiple models to improve accuracy and stability versus relying on any single model. However, while ensembles regularly demonstrate stability against individual model failures, improved accuracy is not guaranteed. During the COVID-19 pandemic, the CDC's multi-model ensemble outperformed the best single model by only 1\%, and CDC flu ensembles have often ranked below individual models. Prior work has established that ensemble performance depends critically on diversity: when models make independent errors, combining them yields substantial gains. In practice, however, this diversity is often lacking. Here, we propose that this is due in part to how models are developed and selected: both modelers and ensemble builders optimize for stand-alone accuracy rather than ensemble contribution, and most epidemic forecasts are built from a small set of approaches trained on the same surveillance data. The result is highly correlated errors, limiting the benefit of ensembling. This suggests that in developing models and ensembles, we should prioritize models that contribute complementary information rather than replicating existing approaches. We present a toy example illustrating the theoretical cost of correlated errors, analyze correlations among COVID-19 forecasting models, and propose improvements to model fitting and ensemble construction that foster genuine diversity. Ensembles built with this principle in mind produce forecasts that are more robust and more valuable for epidemic preparedness and response.

Paper Structure

This paper contains 7 sections, 3 equations, 2 figures.

Figures (2)

  • Figure 1: Residual correlation structure among forecasting models. Heatmap of Pearson correlations between model residuals for selected case forecasting models in the CDC COVID-19 Forecast Hub from July 2020 to December 2022. Models were restricted to those with at least one year of overlapping forecasts with at least one other model for weekly cases. Models are clustered into three groups using agglomerative clustering. Cell annotations report correlation values. Extremely high within-cluster correlations (often $>0.95$) contrasted with much weaker or negative between-cluster correlations indicate that the ensemble was composed of only a few distinct families of models, limiting potential gains when diversity is low. White squares indicate that two models never had a full year of common overlap. Apparent discrepancies (e.g., IHME correlating positively with USACE but negatively with CU models, even though USACE and CU are strongly correlated) arise because pairs were evaluated on different periods of overlap.
  • Figure 2: Impact of error correlation on ensemble performance. The blue line shows the expected ensemble skill for $N=28$ unbiased models of equal quality, each only 5% better than the baseline (versus 13% median skill from the COVID-19 ensemble evaluationpnas). As correlation increases, the benefit of ensembling declines sharply. The dashed line marks the observed performance of the CDC COVID-19 ensemble ($\approx 0.66$) evaluationpnas, far above the theoretical potential ($\approx 0.03$ if errors were uncorrelated). This gap reflects the high correlation of errors among current models, limiting ensemble gains.