Table of Contents
Fetching ...

Leveraging Synthetic and Genetic Data to Improve Epidemic Forecasting

Dave Osthus, Alexander C. Murph, Emma E. Goldberg, Lauren J. Beesley, William M. Fischer, Nidhi K. Parikh, Lauren A. Castro

Abstract

Forecasting infectious disease outbreaks is hard. Forecasting emerging infectious diseases with limited historical data is even harder. In this paper, we investigate ways to improve emerging infectious disease forecasting under operational constraints. Specifically, we explore two options likely to be available near the start of an emerging disease outbreak: synthetic data and genetic information. For this investigation, we conducted an experiment where we trained deep learning models on different combinations of real and synthetic data, both with and without genetic information, to explore how these models compare when forecasting COVID-19 cases for US states. All models are developed with an eye towards forecasting the next pandemic. We find that models trained with synthetic data have better forecast accuracy than models trained on real data alone, and models that use genetic variants have better forecast accuracy compared to those that do not. All models outperformed a baseline persistence model (a feat only accomplished by 7 out of 22 real-time COVID-19 cases forecasting models as reported in [38]) and multiple models outperformed the COVIDHub-4_week_ensemble. This paper demonstrates the value of these underutilized sources of information and provides a blueprint for forecasting future pandemics.

Leveraging Synthetic and Genetic Data to Improve Epidemic Forecasting

Abstract

Forecasting infectious disease outbreaks is hard. Forecasting emerging infectious diseases with limited historical data is even harder. In this paper, we investigate ways to improve emerging infectious disease forecasting under operational constraints. Specifically, we explore two options likely to be available near the start of an emerging disease outbreak: synthetic data and genetic information. For this investigation, we conducted an experiment where we trained deep learning models on different combinations of real and synthetic data, both with and without genetic information, to explore how these models compare when forecasting COVID-19 cases for US states. All models are developed with an eye towards forecasting the next pandemic. We find that models trained with synthetic data have better forecast accuracy than models trained on real data alone, and models that use genetic variants have better forecast accuracy compared to those that do not. All models outperformed a baseline persistence model (a feat only accomplished by 7 out of 22 real-time COVID-19 cases forecasting models as reported in [38]) and multiple models outperformed the COVIDHub-4_week_ensemble. This paper demonstrates the value of these underutilized sources of information and provides a blueprint for forecasting future pandemics.

Paper Structure

This paper contains 43 sections, 19 equations, 19 figures, 5 tables.

Figures (19)

  • Figure 1: COVID-19 data for Alabama and California. (a) Weekly total cases (TCs). (b) Proportion of sampled viral genomes assigned to each variant. (c) Variant-attributable cases (VACs), computed as TCs times the proportion of genomes assigned to each variant. VACs summed over all variants equal the TCs. Note the square-root scale on the y-axis for better visibility of low-count VACs.
  • Figure 2: Examples of non-COVID-19, real respiratory data. Over 2,000 time series are available for training, amounting to over 2 million observations.
  • Figure 3: MutAntiGen example runs. MutAntiGen outputs both the total number of cases (top row, TC) and the time series of cases attributed to each variant (bottom row, VAC; each line and color represents a different variant). For each time point, the sum of all variant-attributable cases (bottom row) equals the total cases (top row).
  • Figure 4: 10 of the 20 realizations from the observation model corresponding to a single MutAntiGen output. Realizations were generated by subjecting the "clean" MutAntiGen output to either scaling (random-magnitude compression of the x-axis) and (possible) addition of outliers (top row), or to scaling plus addition of noise and (possibly) outliers (bottom row).
  • Figure 5: Selected 1-week-ahead through 4-week-ahead forecasts for New Mexico for all models. Black line: total cases time series. Colored points: median forecast cases. Ribbons mark the 50%, 80%, and 95% forecast intervals. Note: y-axis is on a square root scale to better see the low-case-count forecasts.
  • ...and 14 more figures