Machine learning in an expectation-maximisation framework for nowcasting
Paul Wilsens, Katrien Antonio, Gerda Claeskens
TL;DR
The paper addresses nowcasting under reporting delays by extending the EM framework to incorporate machine learning in the maximisation step, enabling flexible modelling of both occurrence intensities and reporting probabilities without fixed functional forms. It demonstrates that neural networks and especially XGBoost can capture nonlinear covariate effects and interdependencies, outperforming traditional GLMs in nonlinear settings. Simulation studies show additive XGBoost provides the most accurate parameter estimates, while an ML-based EM framework stabilizes training via weight initialisation and early stopping. Application to Argentinian Covid-19 data confirms the method’s practical value, with XGBoost delivering the best out-of-sample nowcasts and delay-aware insights across demography and regions.
Abstract
Decision making often occurs in the presence of incomplete information, leading to the under- or overestimation of risk. Leveraging the observable information to learn the complete information is called nowcasting. In practice, incomplete information is often a consequence of reporting or observation delays. In this paper, we propose an expectation-maximisation (EM) framework for nowcasting that uses machine learning techniques to model both the occurrence as well as the reporting process of events. We allow for the inclusion of covariate information specific to the occurrence and reporting periods as well as characteristics related to the entity for which events occurred. We demonstrate how the maximisation step and the information flow between EM iterations can be tailored to leverage the predictive power of neural networks and (extreme) gradient boosting machines (XGBoost). With simulation experiments, we show that we can effectively model both the occurrence and reporting of events when dealing with high-dimensional covariate information. In the presence of non-linear effects, we show that our methodology outperforms existing EM-based nowcasting frameworks that use generalised linear models in the maximisation step. Finally, we apply the framework to the reporting of Argentinian Covid-19 cases, where the XGBoost-based approach again is most performant.
