Beyond forecast leaderboards: Measuring individual model importance based on contribution to ensemble accuracy
Minsu Kim, Evan L. Ray, Nicholas G. Reich
TL;DR
This paper introduces two Shapley-value-based metrics, LASOMO and LOMO, to quantify how individual forecasters contribute to an ensemble's predictive accuracy in probabilistic forecasting, going beyond standard ensemble performance metrics. By decomposing the ensemble importance into components tied to each forecaster's own accuracy and the similarity of their error patterns, the authors provide a principled interpretation of when a model adds value or reduces ensemble quality. The methods are demonstrated analytically, through simulations, and with real-world US COVID-19 death forecasts from the Forecast Hub, including a Massachusetts case study that highlights the importance of diversity and counterbalancing biases. The work offers a practical framework for diagnosing ensemble composition, with potential to incentivize diverse, complementary forecasting approaches in decision-relevant settings.
Abstract
Ensemble forecasts often outperform forecasts from individual standalone models, and have been used to support decision-making and policy planning in various fields. As collaborative forecasting efforts to create effective ensembles grow, so does interest in understanding individual models' relative importance in the ensemble. To this end, we propose two practical methods that measure the difference between ensemble performance when a given model is or is not included in the ensemble: a leave-one-model-out algorithm and a leave-all-subsets-of-models-out algorithm, which is based on the Shapley value. We explore the relationship between these metrics, forecast accuracy, and the similarity of errors, both analytically and through simulations. We illustrate this measure of the value a component model adds to an ensemble in the presence of other models using US COVID-19 death probabilistic forecasts. This study offers valuable insight into individual models' unique features within an ensemble, which standard accuracy metrics alone cannot reveal.
