Table of Contents
Fetching ...

Monitoring the calibration of probability forecasts with an application to concept drift detection involving image classification

Christopher T. Franck, Anne R. Driscoll, Zoe Szajnfarber, William H. Woodall

TL;DR

The paper addresses the challenge of prospectively monitoring calibration of probability forecasts in image classification under potential concept drift. It introduces a calibration CUSUM chart with dynamic probability control limits (DPCLs) and a linear-log-odds (LLO) recalibration framework to detect when predictions cease to be well calibrated, using only predicted probabilities $x$ and binary outcomes $y$. Validation includes Monte Carlo simulations and a CIFAR-10 case study showing that the method maintains CFAR while signaling miscalibration quickly when drift occurs, including the emergence of new subtypes. The approach is model-agnostic and broadly applicable to any sequential prediction setting where calibration over time is critical, without requiring internal access to the predictive model.

Abstract

Machine learning approaches for image classification have led to impressive advances in that field. For example, convolutional neural networks are able to achieve remarkable image classification accuracy across a wide range of applications in industry, defense, and other areas. While these machine learning models boast impressive accuracy, a related concern is how to assess and maintain calibration in the predictions these models make. A classification model is said to be well calibrated if its predicted probabilities correspond with the rates events actually occur. While there are many available methods to assess machine learning calibration and recalibrate faulty predictions, less effort has been spent on developing approaches that continually monitor predictive models for potential loss of calibration as time passes. We propose a cumulative sum-based approach with dynamic limits that enable detection of miscalibration in both traditional process monitoring and concept drift applications. This enables early detection of operational context changes that impact image classification performance in the field. The proposed chart can be used broadly in any situation where the user needs to monitor probability predictions over time for potential lapses in calibration. Importantly, our method operates on probability predictions and event outcomes and does not require under-the-hood access to the machine learning model.

Monitoring the calibration of probability forecasts with an application to concept drift detection involving image classification

TL;DR

The paper addresses the challenge of prospectively monitoring calibration of probability forecasts in image classification under potential concept drift. It introduces a calibration CUSUM chart with dynamic probability control limits (DPCLs) and a linear-log-odds (LLO) recalibration framework to detect when predictions cease to be well calibrated, using only predicted probabilities and binary outcomes . Validation includes Monte Carlo simulations and a CIFAR-10 case study showing that the method maintains CFAR while signaling miscalibration quickly when drift occurs, including the emergence of new subtypes. The approach is model-agnostic and broadly applicable to any sequential prediction setting where calibration over time is critical, without requiring internal access to the predictive model.

Abstract

Machine learning approaches for image classification have led to impressive advances in that field. For example, convolutional neural networks are able to achieve remarkable image classification accuracy across a wide range of applications in industry, defense, and other areas. While these machine learning models boast impressive accuracy, a related concern is how to assess and maintain calibration in the predictions these models make. A classification model is said to be well calibrated if its predicted probabilities correspond with the rates events actually occur. While there are many available methods to assess machine learning calibration and recalibrate faulty predictions, less effort has been spent on developing approaches that continually monitor predictive models for potential loss of calibration as time passes. We propose a cumulative sum-based approach with dynamic limits that enable detection of miscalibration in both traditional process monitoring and concept drift applications. This enables early detection of operational context changes that impact image classification performance in the field. The proposed chart can be used broadly in any situation where the user needs to monitor probability predictions over time for potential lapses in calibration. Importantly, our method operates on probability predictions and event outcomes and does not require under-the-hood access to the machine learning model.

Paper Structure

This paper contains 8 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Schematic of the proposed calibration CUSUM chart. The horizontal axis is time, and the vertical axis corresppnds to values of the CUSUM statistic. The solid black line corresponds to the observed value of the CUSUM statistic, and the dotted blue line provides the control limits. For the purpose of monitoring calibration of probability predictions, low values of the statistic correspond to the hypothesis that the model predictions are well calibrated while high values of the statistic correspond to a specific hypothesis that the predictions are not well calibrated. The left panel illustrates a scenario where calibration is maintained across the observed time course, as the CUSUM statistic does not exceed the control limits. The right panel includes a loss of calibration (vertical red line), and thus the CUSUM statistic (black) exceeds the control limits (blue) shortly thereafter, indicating to the user that probability predictions from the model are no longer well calibrated.
  • Figure 2: Illustration of calibration analysis using CNN confidence scores $x$ and LLO recalibrated predictions $p$. The event being predicted is whether an image contains a vehicle. The left panel shows the recalibration curve. The solid black line shows the transformation needed for badly calibrated CNN confidence scores $x$ to be translated to well calibrated predictions $p$. The $x=y$ line would arise if the original $x$ predictions were well calibrated initially. The right panel is a line plot where each line corresponds to a prediction from a single image. Blue lines are predictions for images that actually contain a vehicle, and red lines are for images that contain animals. This panel shows that the initial $x$ predictions were overly-confident, and the recalibration function brings them further away from the extremes of zero and one. Maximum likelihood analysis was performed using the 10,000 predictions in the CIFAR-10 test set. MLEs for these data are $\hat{\delta}=0.89$ and $\hat{\gamma}=0.17$ The right panel line plot shows a smaller subset of the predictions for visual clarity.
  • Figure 3: Flowchart for how the calibration monitoring scheme works. Model is trained and initial calibration is verified, then monitoring begins. At each time point $t$, prediction $p_t$ and event data $y_t$ are collected, a CUSUM statistic $S_t$ is computed along with control limits. Monitoring continues until the chart signals. Full details of the calibration CUSUM monitoring approach are in Section \ref{['sec:CUSUMmethod']}.
  • Figure 4: Calibration CUSUM charts for the CIFAR-10 vehicle classification monitoring problem using $\delta_a=1$ and $\gamma_a =1/2$. Top left panel shows CNN confidence scores $x$ from a neural net using all training data but with no recalibration. The CUSUM statistic increases rapidly as raw CNN confidence scores are not calibrated. Bottom left shows CNN recalibrated probabilities $p$. The blue line shows the DPCLs generated with $\alpha=10^{-5}$ and $Q=100,000$. Predictions remain calibrated across the time course. Top right shows CNN confidence scores $x$ using CNN that had no fliers (birds or planes) in the training set, and was not recalibrated. Bottom right uses CNN and recalibration that both omit birds and planes. The first 8000 time points involve non-flying objects and the chart does not signal. The final 2000 time points involve the fliers - birds and planes. The chart signals after exposure to 47 images of fliers.
  • Figure 5: Calibration CUSUM charts for the CIFAR-10 analysis considering four separate alternatives. The "shift down" chart (top left) uses $\delta_a=1/2$, $\gamma_a=1$. The "shift up" chart (top right) uses $\delta_a=2$, $\gamma_a=1$. The "scale down" chart (bottom left) uses $\delta_a=1$, $\gamma_a=1/2$. The "scale up" chart (bottom right) uses $\delta_a=1$, $\gamma_a=2$.