Table of Contents
Fetching ...

Confidence-based Estimators for Predictive Performance in Model Monitoring

Juhani Kivimäki, Jakub Białek, Jukka K. Nurminen, Wojtek Kuberski

TL;DR

This work addresses unsupervised estimation of predictive accuracy for deployed ML systems when ground-truth labels are delayed or unavailable. It provides a theoretical justification for Average Confidence (AC) as an unbiased and consistent estimator under a calibration assumption and derives Poisson-binomial-based confidence intervals for AC, enabling rigorous uncertainty quantification. Through experiments, AC often matches or outperforms more sophisticated confidence-based estimators under covariate shift, while highlighting a strong link between calibration quality and estimation accuracy. The study also discusses limitations such as concept shift and data outside the training support, and outlines future directions for calibration-aware monitoring across broader metrics and real-world settings.

Abstract

After a machine learning model has been deployed into production, its predictive performance needs to be monitored. Ideally, such monitoring can be carried out by comparing the model's predictions against ground truth labels. For this to be possible, the ground truth labels must be available relatively soon after inference. However, there are many use cases where ground truth labels are available only after a significant delay, or in the worst case, not at all. In such cases, directly monitoring the model's predictive performance is impossible. Recently, novel methods for estimating the predictive performance of a model when ground truth is unavailable have been developed. Many of these methods leverage model confidence or other uncertainty estimates and are experimentally compared against a naive baseline method, namely Average Confidence (AC), which estimates model accuracy as the average of confidence scores for a given set of predictions. However, until now the theoretical properties of the AC method have not been properly explored. In this paper, we try to fill this gap by reviewing the AC method and show that under certain general assumptions, it is an unbiased and consistent estimator of model accuracy with many desirable properties. We also compare this baseline estimator against some more complex estimators empirically and show that in many cases the AC method is able to beat the others, although the comparative quality of the different estimators is heavily case-dependent.

Confidence-based Estimators for Predictive Performance in Model Monitoring

TL;DR

This work addresses unsupervised estimation of predictive accuracy for deployed ML systems when ground-truth labels are delayed or unavailable. It provides a theoretical justification for Average Confidence (AC) as an unbiased and consistent estimator under a calibration assumption and derives Poisson-binomial-based confidence intervals for AC, enabling rigorous uncertainty quantification. Through experiments, AC often matches or outperforms more sophisticated confidence-based estimators under covariate shift, while highlighting a strong link between calibration quality and estimation accuracy. The study also discusses limitations such as concept shift and data outside the training support, and outlines future directions for calibration-aware monitoring across broader metrics and real-world settings.

Abstract

After a machine learning model has been deployed into production, its predictive performance needs to be monitored. Ideally, such monitoring can be carried out by comparing the model's predictions against ground truth labels. For this to be possible, the ground truth labels must be available relatively soon after inference. However, there are many use cases where ground truth labels are available only after a significant delay, or in the worst case, not at all. In such cases, directly monitoring the model's predictive performance is impossible. Recently, novel methods for estimating the predictive performance of a model when ground truth is unavailable have been developed. Many of these methods leverage model confidence or other uncertainty estimates and are experimentally compared against a naive baseline method, namely Average Confidence (AC), which estimates model accuracy as the average of confidence scores for a given set of predictions. However, until now the theoretical properties of the AC method have not been properly explored. In this paper, we try to fill this gap by reviewing the AC method and show that under certain general assumptions, it is an unbiased and consistent estimator of model accuracy with many desirable properties. We also compare this baseline estimator against some more complex estimators empirically and show that in many cases the AC method is able to beat the others, although the comparative quality of the different estimators is heavily case-dependent.
Paper Structure (24 sections, 3 theorems, 11 equations, 8 figures, 8 tables)

This paper contains 24 sections, 3 theorems, 11 equations, 8 figures, 8 tables.

Key Result

Lemma 1

Let $(X,Y)$ be an instance drawn from a target distribution $p_t(\boldsymbol{x}, y)$ and let $(\hat{Y}, S)$ be the corresponding prediction by a calibrated model $\boldsymbol{\mathrm{f}}$. Then,

Figures (8)

  • Figure 1: An example of estimating the predictive accuracy over 6 batches of 500 predictions using the AC method with CIs. The point estimate (blue line) given by the AC method closely follows the true accuracy (green line) in each batch, which in turn might deviate from the expected accuracy for the whole dataset (magenta line). In each case, the true (batch) accuracy also falls within the predicted 95 % confidence interval (red lines). The PMF of the Poisson binomial distribution for each batch is shown in light blue.
  • Figure 2: Distribution of confidence scores in the simulated data.
  • Figure 3: The quality of point estimates under gradual data drift.
  • Figure 4: The quality of the estimated CIs for the original and shifted data.
  • Figure 5: A visualization of the covariate shift with linear decision boundary. The dashed line signs the decision boundary of the Bayes optimal classifier. The modes of the hard-to-predict mixture components are marked with '$\star$' and the modes of the easy-to-predict components are marked with '$\blacksquare$'.
  • ...and 3 more figures

Theorems & Definitions (7)

  • Definition 1
  • Lemma 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof