Table of Contents
Fetching ...

Beyond forecast leaderboards: Measuring individual model importance based on contribution to ensemble accuracy

Minsu Kim, Evan L. Ray, Nicholas G. Reich

TL;DR

This paper introduces two Shapley-value-based metrics, LASOMO and LOMO, to quantify how individual forecasters contribute to an ensemble's predictive accuracy in probabilistic forecasting, going beyond standard ensemble performance metrics. By decomposing the ensemble importance into components tied to each forecaster's own accuracy and the similarity of their error patterns, the authors provide a principled interpretation of when a model adds value or reduces ensemble quality. The methods are demonstrated analytically, through simulations, and with real-world US COVID-19 death forecasts from the Forecast Hub, including a Massachusetts case study that highlights the importance of diversity and counterbalancing biases. The work offers a practical framework for diagnosing ensemble composition, with potential to incentivize diverse, complementary forecasting approaches in decision-relevant settings.

Abstract

Ensemble forecasts often outperform forecasts from individual standalone models, and have been used to support decision-making and policy planning in various fields. As collaborative forecasting efforts to create effective ensembles grow, so does interest in understanding individual models' relative importance in the ensemble. To this end, we propose two practical methods that measure the difference between ensemble performance when a given model is or is not included in the ensemble: a leave-one-model-out algorithm and a leave-all-subsets-of-models-out algorithm, which is based on the Shapley value. We explore the relationship between these metrics, forecast accuracy, and the similarity of errors, both analytically and through simulations. We illustrate this measure of the value a component model adds to an ensemble in the presence of other models using US COVID-19 death probabilistic forecasts. This study offers valuable insight into individual models' unique features within an ensemble, which standard accuracy metrics alone cannot reveal.

Beyond forecast leaderboards: Measuring individual model importance based on contribution to ensemble accuracy

TL;DR

This paper introduces two Shapley-value-based metrics, LASOMO and LOMO, to quantify how individual forecasters contribute to an ensemble's predictive accuracy in probabilistic forecasting, going beyond standard ensemble performance metrics. By decomposing the ensemble importance into components tied to each forecaster's own accuracy and the similarity of their error patterns, the authors provide a principled interpretation of when a model adds value or reduces ensemble quality. The methods are demonstrated analytically, through simulations, and with real-world US COVID-19 death forecasts from the Forecast Hub, including a Massachusetts case study that highlights the importance of diversity and counterbalancing biases. The work offers a practical framework for diagnosing ensemble composition, with potential to incentivize diverse, complementary forecasting approaches in decision-relevant settings.

Abstract

Ensemble forecasts often outperform forecasts from individual standalone models, and have been used to support decision-making and policy planning in various fields. As collaborative forecasting efforts to create effective ensembles grow, so does interest in understanding individual models' relative importance in the ensemble. To this end, we propose two practical methods that measure the difference between ensemble performance when a given model is or is not included in the ensemble: a leave-one-model-out algorithm and a leave-all-subsets-of-models-out algorithm, which is based on the Shapley value. We explore the relationship between these metrics, forecast accuracy, and the similarity of errors, both analytically and through simulations. We illustrate this measure of the value a component model adds to an ensemble in the presence of other models using US COVID-19 death probabilistic forecasts. This study offers valuable insight into individual models' unique features within an ensemble, which standard accuracy metrics alone cannot reveal.

Paper Structure

This paper contains 22 sections, 14 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Distributional forecasts of COVID-19 incident deaths at 1- through 4-week horizons in Massachusetts made on November 27, 2021, by three models. Solid black dots show historical data available as of November 28. Blue dots indicate predictive medians, and the shaded bands represent 95% prediction intervals. The open black circles are observations not available when the forecast was made. The 95% prediction intervals of the UMass-MechBayes model (truncated here for better visibility of the observed data) extend up to 671 and 1110 for the 3-week and 4-week ahead horizons, respectively.
  • Figure 2: Expected importance of three forecasters as a function of the prediction/bias of forecaster 3 in simulation settings: (a) $\hat{y}_{1} = -1$, $\hat{y}_{2} =-0.5$, and $\hat{y}_{3} =b$ based on the negative SPE, (b) $F_{1,\tau} = N(-1, 1)$, $F_{2,\tau} = N(-0.5, 1)$, and $F_{3,\tau} = N(b, 1)$ based on the negative WIS, where $\tau=1,\dots,1000$. The data generating process is $N(0,1).$ The expected importance metrics were calculated and averaged over $1000$ replicates of the forecasting experiments conducted at each value of $b$, incremented by 0.05 from $-1$ to $3$.
  • Figure 3: Expected importance of three forecasters as a function of dispersion of forecaster 3 in the simulation setting: $F_{1,\tau} = N(0, 0.5^2)$, $F_{2,\tau} = N(0, 0.7^2)$, and $F_{3,\tau} = N(0, s^2)$ based on the negative WIS, where $\tau=1,\dots,1000$. The data generating process is $N(0,1).$ The expected importance metrics were calculated and averaged over $1000$ replicates of the forecasting experiments conducted at each value of $s$, incremented by 0.05 from $0.1$ to $3$.
  • Figure 4: Model importance versus negative WIS by model for all weeks in 2021. Each triangle represents a pair of negative WIS ($x-$axis, larger values indicate more accurate forecasts) and importance metric ($y-$axis, larger values indicate more important forecasts) for a week in 2021. Solid black circles represent negative WIS and importance metric pairs evaluated for the one week ending December 25, 2021 (see more details in \ref{['fig:20211225']}). The horizontal dashed lines indicate the value of zero. The importance of an individual model as an ensemble member tends to be positively correlated with the value of negative WIS; that is, the importance metric has a positive correlation with the model's prediction accuracy.
  • Figure 5: (\ref{['fig:wis_imp20211225']}) Model importance of each model versus negative WIS in Massachusetts on target end date 2021-12-25. CovidAnalytics-DELPHI is the most important and also the least accurate by $-\text{WIS}.$ (\ref{['fig:PIs2021']}) Predictive medians and 95% Prediction intervals (PIs) of individual forecasts (top) and ensemble forecasts built leaving one model out (bottom) on target end date 2021-12-25. For example, the lines on the far left indicate PI for the CovidAnalytics-DELPHI model on the top panel and PI for the ensemble created without the CovidAnalytics-DELPHI model on the bottom panel. None(ensemble of all) represents an ensemble model built on all nine individual models. In each PI, the end points indicate 0.025 and 0.975 quantiles and the mid-point represents the 0.5 quantile (predictive median). The horizontal dashed lines represent the eventual observation. The ensemble without CovidAnalytics-DELPHI is the only ensemble model with a point estimate below 150. The models on the $x$-axis are listed in order of model importance.
  • ...and 2 more figures