Table of Contents
Fetching ...

Machine learning in an expectation-maximisation framework for nowcasting

Paul Wilsens, Katrien Antonio, Gerda Claeskens

TL;DR

The paper addresses nowcasting under reporting delays by extending the EM framework to incorporate machine learning in the maximisation step, enabling flexible modelling of both occurrence intensities and reporting probabilities without fixed functional forms. It demonstrates that neural networks and especially XGBoost can capture nonlinear covariate effects and interdependencies, outperforming traditional GLMs in nonlinear settings. Simulation studies show additive XGBoost provides the most accurate parameter estimates, while an ML-based EM framework stabilizes training via weight initialisation and early stopping. Application to Argentinian Covid-19 data confirms the method’s practical value, with XGBoost delivering the best out-of-sample nowcasts and delay-aware insights across demography and regions.

Abstract

Decision making often occurs in the presence of incomplete information, leading to the under- or overestimation of risk. Leveraging the observable information to learn the complete information is called nowcasting. In practice, incomplete information is often a consequence of reporting or observation delays. In this paper, we propose an expectation-maximisation (EM) framework for nowcasting that uses machine learning techniques to model both the occurrence as well as the reporting process of events. We allow for the inclusion of covariate information specific to the occurrence and reporting periods as well as characteristics related to the entity for which events occurred. We demonstrate how the maximisation step and the information flow between EM iterations can be tailored to leverage the predictive power of neural networks and (extreme) gradient boosting machines (XGBoost). With simulation experiments, we show that we can effectively model both the occurrence and reporting of events when dealing with high-dimensional covariate information. In the presence of non-linear effects, we show that our methodology outperforms existing EM-based nowcasting frameworks that use generalised linear models in the maximisation step. Finally, we apply the framework to the reporting of Argentinian Covid-19 cases, where the XGBoost-based approach again is most performant.

Machine learning in an expectation-maximisation framework for nowcasting

TL;DR

The paper addresses nowcasting under reporting delays by extending the EM framework to incorporate machine learning in the maximisation step, enabling flexible modelling of both occurrence intensities and reporting probabilities without fixed functional forms. It demonstrates that neural networks and especially XGBoost can capture nonlinear covariate effects and interdependencies, outperforming traditional GLMs in nonlinear settings. Simulation studies show additive XGBoost provides the most accurate parameter estimates, while an ML-based EM framework stabilizes training via weight initialisation and early stopping. Application to Argentinian Covid-19 data confirms the method’s practical value, with XGBoost delivering the best out-of-sample nowcasts and delay-aware insights across demography and regions.

Abstract

Decision making often occurs in the presence of incomplete information, leading to the under- or overestimation of risk. Leveraging the observable information to learn the complete information is called nowcasting. In practice, incomplete information is often a consequence of reporting or observation delays. In this paper, we propose an expectation-maximisation (EM) framework for nowcasting that uses machine learning techniques to model both the occurrence as well as the reporting process of events. We allow for the inclusion of covariate information specific to the occurrence and reporting periods as well as characteristics related to the entity for which events occurred. We demonstrate how the maximisation step and the information flow between EM iterations can be tailored to leverage the predictive power of neural networks and (extreme) gradient boosting machines (XGBoost). With simulation experiments, we show that we can effectively model both the occurrence and reporting of events when dealing with high-dimensional covariate information. In the presence of non-linear effects, we show that our methodology outperforms existing EM-based nowcasting frameworks that use generalised linear models in the maximisation step. Finally, we apply the framework to the reporting of Argentinian Covid-19 cases, where the XGBoost-based approach again is most performant.

Paper Structure

This paper contains 44 sections, 36 equations, 28 figures, 6 tables, 3 algorithms.

Figures (28)

  • Figure 1: Occurrence and reporting timing of four events that happened in the same past occurrence period (highlighted in grey). Event occurrence is indicated with a black dot, event reporting with an x-mark. The reporting delay is visualised by the black line between a dot and an x-mark. A blue or red x-mark on the time axis indicates whether the occurrence of an event is observable or unobservable at present time, respectively.
  • Figure 2: Occurrence and reporting timing of six events that happened in two different occurrence periods for two different entities, denoted $A$ and $B$. The two occurrence periods are highlighted in grey. A blue or red x-mark indicates whether an event is observable or unobservable at present time $\tau$, respectively.
  • Figure 3: Visualisation of the consecutive steps for the model-agnostic expectation-maximisation framework. No information is passed between EM iterations except for the estimates for the occurrence intensities and the reporting probabilities.
  • Figure 4: Visualisation of the consecutive steps in the expectation-maximisation algorithm for the occurrence model when using a neural network with weight initialisation in the maximisation step. The transfer of the network weights is indicated with a curved dashed line. The network structure used within each EM iteration corresponds to Figure \ref{['fig:NNocc']} in Appendix \ref{['append:modelocc']}.
  • Figure 5: Visualisation of the consecutive steps in the expectation-maximisation algorithm for the occurrence model when using an additive XGBoost model in the maximisation step. Regression trees are visualised with a generic tree structure. The '$+$' symbols indicate that regression trees are additive within an EM iteration as well as between EM iterations.
  • ...and 23 more figures