Table of Contents
Fetching ...

A Kalman Filter Based Framework for Monitoring the Performance of In-Hospital Mortality Prediction Models Over Time

Jiacheng Liu, Lisa Kirkland, Jaideep Srivastava

TL;DR

The paper tackles the problem of comparing binary classifier performance over time when real-world data batches vary in size and class distribution, which biases metrics like $AUCROC$. It introduces a Kalman-filter–based framework that estimates a time-varying mean performance $\theta_t$ and its variance by incorporating current batch statistics $m_t$, $n_t$, and extrapolating the uncertainty to the next window, using $z_t$ (sample $AUCROC$) and $r_t$ (variance) with a conservative upper bound when positives are scarce. The method is demonstrated on synthetic data and a retrospective 2-day-ahead in-hospital mortality model for COVID-19 (2021–2022), showing that filtered performance remains stable despite changes in disease variants, treatments, and hospital operations, and highlighting the framework's potential generalization to other metrics such as AUCPR. The work provides a practical approach for real-world monitoring of predictive performance, enabling fair cross-period comparisons and more reliable deployment decisions in healthcare settings.

Abstract

Unlike in a clinical trial, where researchers get to determine the least number of positive and negative samples required, or in a machine learning study where the size and the class distribution of the validation set is static and known, in a real-world scenario, there is little control over the size and distribution of incoming patients. As a result, when measured during different time periods, evaluation metrics like Area under the Receiver Operating Curve (AUCROC) and Area Under the Precision-Recall Curve(AUCPR) may not be directly comparable. Therefore, in this study, for binary classifiers running in a long time period, we proposed to adjust these performance metrics for sample size and class distribution, so that a fair comparison can be made between two time periods. Note that the number of samples and the class distribution, namely the ratio of positive samples, are two robustness factors which affect the variance of AUCROC. To better estimate the mean of performance metrics and understand the change of performance over time, we propose a Kalman filter based framework with extrapolated variance adjusted for the total number of samples and the number of positive samples during different time periods. The efficacy of this method is demonstrated first on a synthetic dataset and then retrospectively applied to a 2-days ahead in-hospital mortality prediction model for COVID-19 patients during 2021 and 2022. Further, we conclude that our prediction model is not significantly affected by the evolution of the disease, improved treatments and changes in hospital operational plans.

A Kalman Filter Based Framework for Monitoring the Performance of In-Hospital Mortality Prediction Models Over Time

TL;DR

The paper tackles the problem of comparing binary classifier performance over time when real-world data batches vary in size and class distribution, which biases metrics like . It introduces a Kalman-filter–based framework that estimates a time-varying mean performance and its variance by incorporating current batch statistics , , and extrapolating the uncertainty to the next window, using (sample ) and (variance) with a conservative upper bound when positives are scarce. The method is demonstrated on synthetic data and a retrospective 2-day-ahead in-hospital mortality model for COVID-19 (2021–2022), showing that filtered performance remains stable despite changes in disease variants, treatments, and hospital operations, and highlighting the framework's potential generalization to other metrics such as AUCPR. The work provides a practical approach for real-world monitoring of predictive performance, enabling fair cross-period comparisons and more reliable deployment decisions in healthcare settings.

Abstract

Unlike in a clinical trial, where researchers get to determine the least number of positive and negative samples required, or in a machine learning study where the size and the class distribution of the validation set is static and known, in a real-world scenario, there is little control over the size and distribution of incoming patients. As a result, when measured during different time periods, evaluation metrics like Area under the Receiver Operating Curve (AUCROC) and Area Under the Precision-Recall Curve(AUCPR) may not be directly comparable. Therefore, in this study, for binary classifiers running in a long time period, we proposed to adjust these performance metrics for sample size and class distribution, so that a fair comparison can be made between two time periods. Note that the number of samples and the class distribution, namely the ratio of positive samples, are two robustness factors which affect the variance of AUCROC. To better estimate the mean of performance metrics and understand the change of performance over time, we propose a Kalman filter based framework with extrapolated variance adjusted for the total number of samples and the number of positive samples during different time periods. The efficacy of this method is demonstrated first on a synthetic dataset and then retrospectively applied to a 2-days ahead in-hospital mortality prediction model for COVID-19 patients during 2021 and 2022. Further, we conclude that our prediction model is not significantly affected by the evolution of the disease, improved treatments and changes in hospital operational plans.
Paper Structure (16 sections, 7 equations, 6 figures, 4 tables)

This paper contains 16 sections, 7 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Causes of change in predictive model performance over time. The direct causes are those factors that affect mean and variance of evaluation metrics. Note that all factors in this figure may change over time. Besides, the categorization of factor types is metric dependent. Number of ground truth positive samples only affects the variance of AUCROC, but both mean and variance of Area under the Precision-Recall Curve.
  • Figure 2: Monthly AUCROC performance (the blue line) of 2 days ahead in-hospital mortality prediction model for COVID-19 patients. The model is trained on 2020 data only then tested retrospectively in 2021 and 2022. The grey line suggests a strong seasonal trend of number of hospitalized patients per month. Generally, the scale of performance fluctuation, in another word, sample variance, tends to be large when number of patients is low. This is no coincident.
  • Figure 3: Proposed Kalman Filter based framework for estimating model performance over time. Changes made to the classical Kalman filter is highlighted in the red boxes.
  • Figure 4: Results of a 3-Phase Simulation. Ground Truth AUCROC is in black, raw AUCROC is in blue and filtered AUCROC is in red. Phase 1:Step 0-19. Ground truth AUCROC and (Binary)Class distribution (5%) stays the same. However, the total number of samples starts at 5000 and gradually decreased to 50. Phase 2:Step 20-39. Ground truth AUCROC is unchanged. The total number of samples remains 400 during this phase. However, positive ratio is gradually decreased to 2%. Phase 3: Step 40-59. Declined AUCROC, same positive ratio (2%) as in the previous phase, gradually increasing total number of samples from 400 to 5000.
  • Figure 5: Raw monthly AUCROC values of 2 days ahead in-hospital mortality prediction model for COVID-19 patients are represented by the black solid line, the same as the one in Fig. 2. Filtered AUCROC in red. 95% confidence intervals of both raw and filtered AUCROC are in dotted/dashed lines of the respective color.
  • ...and 1 more figures