Investigating Forecasting Models for Pandemic Infections Using Heterogeneous Data Sources: A 2-year Study with COVID-19
Zacharias Komodromos, Kleanthis Malialis, Panayiotis Kolios
TL;DR
The paper addresses near-term COVID-19 infection forecasting in a data-rich, multi-source setting. It leverages two forecasting approaches, XGBoost and ARIMAX, trained on a Cyprus case study spanning two years and integrating epidemiological, vaccination, policy, and weather data. Key findings show that infection-related features are central to predictive performance, external signals such as policy and weather provide additional gains, and vaccination signals have limited near-term power; horizon effects differ by regime, with XGBoost performing better during waves and ARIMAX during non-wave periods. The work advances pandemic preparedness by demonstrating how heterogeneous data fusion and careful feature selection can improve forecast accuracy in a real-world setting and offers generalizable insights for similar regions.
Abstract
Emerging in December 2019, the COVID-19 pandemic caused widespread health, economic, and social disruptions. Rapid global transmission overwhelmed healthcare systems, resulting in high infection rates, hospitalisations, and fatalities. To minimise the spread, governments implemented several non-pharmaceutical interventions like lockdowns and travel restrictions. While effective in controlling transmission, these measures also posed significant economic and societal challenges. Although the WHO declared COVID-19 no longer a global health emergency in May 2023, its impact persists, shaping public health strategies. The vast amount of data collected during the pandemic offers valuable insights into disease dynamics, transmission, and intervention effectiveness. Leveraging these insights can improve forecasting models, enhancing preparedness and response to future outbreaks while mitigating their social and economic impact. This paper presents a large-scale case study on COVID-19 forecasting in Cyprus, utilising a two-year dataset that integrates epidemiological data, vaccination records, policy measures, and weather conditions. We analyse infection trends, assess forecasting performance, and examine the influence of external factors on disease dynamics. The insights gained contribute to improved pandemic preparedness and response strategies.
