Table of Contents
Fetching ...

A causal viewpoint on prediction model performance under changes in case-mix: discrimination and calibration respond differently for prognosis and diagnosis predictions

Wouter A. C. van Amsterdam

TL;DR

The paper presents a causal framework linking case-mix shifts to predictive-performance changes, showing that discrimination and calibration respond differently depending on whether prognosis (causal direction) or diagnosis (anti-causal direction) is being predicted. By defining case-mix shifts as changes in the marginal distribution of the cause variable and analyzing $P(Y|X)$ versus $P(X|Y)$, it proves that calibration is stable under $P(X)$ shifts for causal predictions while discrimination is not, and the reverse for anti-causal predictions. The authors validate the theory with illustrative simulations and with an empirical review of 1,382 models across 2,030 external validations, finding higher variability in discrimination for prognostic models, in line with the framework. These insights inform model development, evaluation, and deployment across clinical settings, emphasizing alignment of features with causal direction and cautious recalibration when calibration matters across environments.

Abstract

Prediction models need reliable predictive performance as they inform clinical decisions, aiding in diagnosis, prognosis, and treatment planning. The predictive performance of these models is typically assessed through discrimination and calibration. Changes in the distribution of the data impact model performance and there may be important changes between a model's current application and when and where its performance was last evaluated. In health-care, a typical change is a shift in case-mix. For example, for cardiovascular risk management, a general practitioner sees a different mix of patients than a specialist in a tertiary hospital. This work introduces a novel framework that differentiates the effects of case-mix shifts on discrimination and calibration based on the causal direction of the prediction task. When prediction is in the causal direction (often the case for prognosis predictions), calibration remains stable under case-mix shifts, while discrimination does not. Conversely, when predicting in the anti-causal direction (often with diagnosis predictions), discrimination remains stable, but calibration does not. A simulation study and empirical validation using cardiovascular disease prediction models demonstrate the implications of this framework. The causal case-mix framework provides insights for developing, evaluating and deploying prediction models across different clinical settings, emphasizing the importance of understanding the causal structure of the prediction task.

A causal viewpoint on prediction model performance under changes in case-mix: discrimination and calibration respond differently for prognosis and diagnosis predictions

TL;DR

The paper presents a causal framework linking case-mix shifts to predictive-performance changes, showing that discrimination and calibration respond differently depending on whether prognosis (causal direction) or diagnosis (anti-causal direction) is being predicted. By defining case-mix shifts as changes in the marginal distribution of the cause variable and analyzing versus , it proves that calibration is stable under shifts for causal predictions while discrimination is not, and the reverse for anti-causal predictions. The authors validate the theory with illustrative simulations and with an empirical review of 1,382 models across 2,030 external validations, finding higher variability in discrimination for prognostic models, in line with the framework. These insights inform model development, evaluation, and deployment across clinical settings, emphasizing alignment of features with causal direction and cautious recalibration when calibration matters across environments.

Abstract

Prediction models need reliable predictive performance as they inform clinical decisions, aiding in diagnosis, prognosis, and treatment planning. The predictive performance of these models is typically assessed through discrimination and calibration. Changes in the distribution of the data impact model performance and there may be important changes between a model's current application and when and where its performance was last evaluated. In health-care, a typical change is a shift in case-mix. For example, for cardiovascular risk management, a general practitioner sees a different mix of patients than a specialist in a tertiary hospital. This work introduces a novel framework that differentiates the effects of case-mix shifts on discrimination and calibration based on the causal direction of the prediction task. When prediction is in the causal direction (often the case for prognosis predictions), calibration remains stable under case-mix shifts, while discrimination does not. Conversely, when predicting in the anti-causal direction (often with diagnosis predictions), discrimination remains stable, but calibration does not. A simulation study and empirical validation using cardiovascular disease prediction models demonstrate the implications of this framework. The causal case-mix framework provides insights for developing, evaluating and deploying prediction models across different clinical settings, emphasizing the importance of understanding the causal structure of the prediction task.
Paper Structure (16 sections, 2 theorems, 8 equations, 8 figures, 3 tables)

This paper contains 16 sections, 2 theorems, 8 equations, 8 figures, 3 tables.

Key Result

Theorem 1

Given binary $Y$, prediction model $f: \mathcal{X} \to [0,1]$ and environment $E \in \{\text{train},\text{test}\}$. Assume $X$ takes on values from a measureable space $\mathcal{X}$ with measures $\phi_{train}(x), \phi_{test}(x)$ on the training and testing environment, and assume $\phi_{test}(x) << Then the integrated calibration index (ICI) austinIntegratedCalibrationIndex2019 on distribution $P

Figures (8)

  • Figure 1: Overview of main results. Depending on the causal direction of the prediction, a shift in 'case-mix' may be defined as either a shift in the marginal distribution of the features$X$ for causal prediction (\ref{['fig:dag-causal']}) or a shift in the marginal distribution of the outcome$Y$ for anti-causal prediction (\ref{['fig:dag-anticausal']}). With these definitions, for models predicting in the causal direction, the calibration will remain constant under case-mix shifts between the training data and the testing data but not the discrimination (\ref{['fig:1causal']}). For models predicting in the anti-causal direction the reverse is true (\ref{['fig:1anticausal']}). The calibration facets are calibration curves with on the horizontal axis the predicted probability and on the vertical axis the actual probability. The discrimination facets are receiver-operating-curves with on the horizontal axis 1 minus specificity and on the vertical axis sensitivity. DAG: directed acyclic graph
  • Figure 2: directed acyclic graphs for 2-variable prediction problems with a shift in case-mix, meaning the environment variable only affects the marginal distribution of only the cause variable. The prediction is always made from feature $X$ to outcome $Y$, $E$ denotes the environment.
  • Figure 3: Overview figure of illustrative simulation experiment of a model trained on data from a screening environment, and evaluated on either the screening environment ('internal validation') or the general practitioner (GP) environment or the hospital environment ('external validation'), with increasing outcome probabilities. For models predicting in the anti-causal direction (e.g. diagnostic models), a shift in case-mix entails a shift in the distribution of the outcome, so discrimination remains the same but calibration changes. For models predicting in the causal direction (e.g. prognosis models), a shift in case-mix entails a shift in the distribution of the features, so calibration remains the same but the discrimination changes. The discrimination facets are receiver-operating-curves with on the horizontal axis 1 minus specificity and on the vertical axis sensitivity. The calibration facets are calibration curves with on the horizontal axis the predicted probability and on the vertical axis the actual probability.
  • Figure 4: Combined results of the simulation experiment. Each model is connected by a line.
  • Figure 5: Example setting of diagnosing a sexually transmittable disease (STD, $=Y$) with a blood test ($=X$) in either general public setting (\ref{['dag:std1']}) or in a HIV-positive clinic (\ref{['dag:std2']}). Patients with previous STDs such as HIV ($Y_0$) have a higher risk of future STDs, summarized with the arrow from $Y_0$ to $Y$. $Y_0=1$ is a selection criterion for the HIV-clinic, meaning that only patients with a prior STD get seen at the HIV-clinic. Treating $Y_0$ as not observed (thus marginalizing it out) results in the DAG in \ref{['dag:std2']}
  • ...and 3 more figures

Theorems & Definitions (7)

  • Definition 1: calibration
  • Definition 2: case mix
  • Remark 1: conditional independencies
  • Theorem 1: perfectly calibrated models remain perfectly calibrated under marginal shifts in $X$
  • proof : Proof of theorem \ref{['th:calib']}
  • Theorem 2: discrimination is constant under marginal shifts in $Y$
  • proof : proof of theorem \ref{['th:discr']}