Table of Contents
Fetching ...

Assessing model error in counterfactual worlds

Emily Howerton, Justin Lessler

TL;DR

The paper addresses the lack of retrospective evaluation for counterfactual scenario projections by formalizing a framework that separates errors due to scenario deviation from model miscalibration. It outlines three evaluative approaches—restricting to plausible scenarios, inferring error distributions, and estimating observations in modeled scenarios—and tests them in a simulated epidemic setting to assess calibration accuracy. Results indicate that approaches leveraging modeled counterfactuals and causal inference (Approaches 2 and 3) more reliably recover calibration error than evaluating only plausible scenarios (Approach 1), with Approach 3 generally preferable when feasible. The work underscores the importance of careful scenario design and provides practical guidance for isolating calibration error to improve decision-making under uncertainty.

Abstract

Counterfactual scenario modeling exercises that ask "what would happen if?" are one of the most common ways we plan for the future. Despite their ubiquity in planning and decision making, scenario projections are rarely evaluated retrospectively. Differences between projections and observations come from two sources: scenario deviation and model miscalibration. We argue the latter is most important for assessing the value of models in decision making, but requires estimating model error in counterfactual worlds. Here we present and contrast three approaches for estimating this error, and demonstrate the benefits and limitations of each in a simulation experiment. We provide recommendations for the estimation of counterfactual error and discuss the components of scenario design that are required to make scenario projections evaluable.

Assessing model error in counterfactual worlds

TL;DR

The paper addresses the lack of retrospective evaluation for counterfactual scenario projections by formalizing a framework that separates errors due to scenario deviation from model miscalibration. It outlines three evaluative approaches—restricting to plausible scenarios, inferring error distributions, and estimating observations in modeled scenarios—and tests them in a simulated epidemic setting to assess calibration accuracy. Results indicate that approaches leveraging modeled counterfactuals and causal inference (Approaches 2 and 3) more reliably recover calibration error than evaluating only plausible scenarios (Approach 1), with Approach 3 generally preferable when feasible. The work underscores the importance of careful scenario design and provides practical guidance for isolating calibration error to improve decision-making under uncertainty.

Abstract

Counterfactual scenario modeling exercises that ask "what would happen if?" are one of the most common ways we plan for the future. Despite their ubiquity in planning and decision making, scenario projections are rarely evaluated retrospectively. Differences between projections and observations come from two sources: scenario deviation and model miscalibration. We argue the latter is most important for assessing the value of models in decision making, but requires estimating model error in counterfactual worlds. Here we present and contrast three approaches for estimating this error, and demonstrate the benefits and limitations of each in a simulation experiment. We provide recommendations for the estimation of counterfactual error and discuss the components of scenario design that are required to make scenario projections evaluable.

Paper Structure

This paper contains 28 sections, 3 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Scenario projection schematic. (A) For low vaccination and high vaccination uptake scenarios, projections for final epidemic size, $(x_i, \mathcal{P}(y|x_i))$, are shown (black solid points). An observation consisting of a realized vaccination uptake and corresponding epidemic size, $(x^*, \mathcal{P}^*(y|x^*))$, is realized after the projection period (red solid point). Some relationship between projected outcomes along the vaccination exists for the model (gray line) and the truth (red line), but they likely will not be available in practice. A projection for the realized scenario, $\mathcal{P}(y|x^*)$, can be obtained retrospectively (open black point) if this relationship can be estimated or if the models can be rerun retrospectively. For each of the modeled scenarios, an observation exists conceptually, $\mathcal{P}^*(y|x_i)$, but will never be observed (open red points). We can obtain the calibration error in the realized scenarios (green bar) by comparing the observation to the model projection from the realized scenario, $\mathcal{P}^*(y|x_i) - \mathcal{P}^*(y|x^*)$. But, we are ultimately interested in calibration error in the modeled scenarios (blue bars), $\mathcal{P}(y|x_i) - \mathcal{P}^*(y|x_i)$. (B) A second set of projections, with a different relationship along the scenario axis. Projections in (A) and (B) have equivalent error in the realized scenario, but different error in the modeled scenarios.
  • Figure 2: Overview of Approach 1. For three sample locations (panels) from a single model in the simulation experiment, projected values (black circle), observations (filled red circle), and what the observations would have been in the scenarios that were modeled (open red circle). The "true" relationship between vaccine uptake (scenario axis) and final epidemic size (projection axis) is shown with a light red line. The true error is shown with a light gray line. Implementing Approach 1 requires two main steps. First, the plausible scenarios were identified for each location, by comparing scenario assumed vaccine uptake and realized vaccine uptake (dotted lines, step 1). In locations 15 and 29, the low vaccination scenario was retrospectively deemed plausible, and in location 41, the high vaccination scenario was plausible. Then, error is calculated by comparing the observed final epidemic size to the projected epidemic size (arrow, step 2) for each location.
  • Figure 3: Overview of Approach 2. (A) For each location, error in the realized scenario is calculated by subtracting the observation (red circle) from the projection that would have been made (open black circle). Projected values for each scenario (black circles) and the projected relationship (gray line) are shown for reference. (B) Error in the realized scenario is plotted as a function of realized vaccine uptake for each location (open gray circles), and a model is fit to estimate infer error in the modeled scenarios. The dark and light gray ribbons show the 50% and 90% and prediction intervals from a generalized addative model fit with a spline term for realized vaccine uptake. Three locations from the simulation experiment are highlighted as examples.
  • Figure 4: Overview of Approach 3. (A) Observations for each location (red circles) are fit across the scenario axis using a generalized additive model with a a spline term for realized vaccine uptake and including location-specific $R_0$ as a covariate. From this model, location-specific estimates of observations in the modeled scenarios can be made (two locations shown as examples, ribbon shows the 50% prediction interval and line shows the median prediction). (B) Then, for each location, projected values (black circles) are compared to the inferred observations (arrows) to calculate error for each scenario that was modeled.
  • Figure 5: Example of error decomposition. Total error is defined as the difference between projection and observation (purple, $P^m(y|x_i) - P^*(y|x^*)$), and can be separated into two components: model calibration error (blue, $P^m(y|x_i) - P^*(y|x_i)$ and scenario specification error (pink, $P^*(y|x_i)-P^*(y|x^*)$).
  • ...and 6 more figures