Designing monitoring strategies for deployed machine learning algorithms: navigating performativity through a causal lens

Jean Feng; Adarsh Subbaswamy; Alexej Gossmann; Harvineet Singh; Berkman Sahiner; Mi-Ok Kim; Gene Pennello; Nicholas Petrick; Romain Pirracchio; Fan Xia

Designing monitoring strategies for deployed machine learning algorithms: navigating performativity through a causal lens

Jean Feng, Adarsh Subbaswamy, Alexej Gossmann, Harvineet Singh, Berkman Sahiner, Mi-Ok Kim, Gene Pennello, Nicholas Petrick, Romain Pirracchio, Fan Xia

TL;DR

The paper addresses post-deployment monitoring under performativity, where ML predictions influence outcomes. It adopts a causal framework to compare multiple sequential monitoring procedures across observational and interventional data, focusing on three criteria related to PPV/NPV and calibration. Through a readmission-risk case study and extensive simulations, it shows that monitoring not only depends on the metric but also on identifiability assumptions and data sources, with a strong case for criterion 3 (calibration-focused) and often favoring observational data (3O) for practical deployment. The work provides a systematic guide for designing, evaluating, and documenting ML monitoring systems in the presence of performativity, with code and open questions to spur further research.

Abstract

After a machine learning (ML)-based system is deployed, monitoring its performance is important to ensure the safety and effectiveness of the algorithm over time. When an ML algorithm interacts with its environment, the algorithm can affect the data-generating mechanism and be a major source of bias when evaluating its standalone performance, an issue known as performativity. Although prior work has shown how to validate models in the presence of performativity using causal inference techniques, there has been little work on how to monitor models in the presence of performativity. Unlike the setting of model validation, there is much less agreement on which performance metrics to monitor. Different monitoring criteria impact how interpretable the resulting test statistic is, what assumptions are needed for identifiability, and the speed of detection. When this choice is further coupled with the decision to use observational versus interventional data, ML deployment teams are faced with a multitude of monitoring options. The aim of this work is to highlight the relatively under-appreciated complexity of designing a monitoring strategy and how causal reasoning can provide a systematic framework for choosing between these options. As a motivating example, we consider an ML-based risk prediction algorithm for predicting unplanned readmissions. Bringing together tools from causal inference and statistical process control, we consider six monitoring procedures (three candidate monitoring criteria and two data sources) and investigate their operating characteristics in simulation studies. Results from this case study emphasize the seemingly simple (and obvious) fact that not all monitoring systems are created equal, which has real-world impacts on the design and documentation of ML monitoring systems.

Designing monitoring strategies for deployed machine learning algorithms: navigating performativity through a causal lens

TL;DR

Abstract

Paper Structure (19 sections, 16 equations, 5 figures, 2 tables)

This paper contains 19 sections, 16 equations, 5 figures, 2 tables.

Introduction
Related Work
Background: sequential monitoring
The case study: ML-based risk prediction algorithms
Candidate monitoring criteria
Data sources and causal models
Candidate monitoring strategies
Monitoring the average PPV/NPV (Criterion 1)
Option 1N: A naïve monitoring procedure
Option 1I (Interventional)
Option 1O (Observational)
Monitoring subgroup-specific PPV/NPVs (Criterion 2)
Checking for over-confident risk predictions (Criterion 3)
Comparing candidate strategies: a simulation study
Discussion
...and 4 more sections

Figures (5)

Figure 1: Causal model describing interfering medical interventions induced by an evolving ML-based risk prediction algorithm $\hat{f}_t$. $X_t$ denotes variables for the patient being queried for at time $t$. $Z_t$ denotes non-patient variables that may affect treatment decisions, such as past performance of the ML algorithm. $A_t$ denotes treatment assignment and $Y_t$ denotes the patient's outcome. Note that $\hat{f}_t(X_t) = (\hat{f}_t(X_t,0), \hat{f}_t(X_t,1))$.
Figure 2: Example control charts, which plot the chart statistic (solid line) and control limit (dashed line) over time. When the chart statistics exceeds the control limit an alarm is fired.
Figure 3: Statistical power of different procedures, as characterized by the proportion of alarms fired at each time point. Dashed vertical line is the time of the distribution shift.
Figure 4: Assessing Type I error for monitoring procedures
Figure 5: Statistical power of different procedures, as characterized by the proportion of alarms fired at each time point. The simulated shift in the conditional distribution of the outcome is gradual over time, with the start time of the shift indicated by the dashed vertical line.

Designing monitoring strategies for deployed machine learning algorithms: navigating performativity through a causal lens

TL;DR

Abstract

Designing monitoring strategies for deployed machine learning algorithms: navigating performativity through a causal lens

Authors

TL;DR

Abstract

Table of Contents

Figures (5)