Table of Contents
Fetching ...

Reliably Detecting Model Failures in Deployment Without Labels

Viet Nguyen, Changjian Shui, Vijay Giri, Siddharth Arya, Amol Verma, Fahad Razak, Rahul G. Krishnan

TL;DR

This paper tackles the challenge of detecting post-deployment deterioration (PDD) without access to deployment labels by introducing D3M, a disagreement-driven monitoring framework. D3M operates in three stages—pre-training a feature extractor and a variational last layer to model a posterior predictive distribution, calibrating maximum disagreement on unlabeled in-distribution data, and deploying with a threshold on observed deployment disagreement—achieving low false positives for non-deteriorating shifts and provable, sample-efficient detection for deteriorating shifts. Theoretical guarantees (under certain assumptions) bound the false-positive rate and ensure positive detection power, while empirical results across UCI, CIFAR, Camelyon17, and GEMINI demonstrate practical viability in diverse modalities and real-world clinical settings. The work demonstrates that label-free, scalable monitoring of model degradation is achievable and can be integrated into high-stakes ML pipelines with strong performance guarantees. Overall, D3M offers a principled, scalable guardrail for production ML systems facing distribution drift without leaking training data or requiring ongoing access to labels.

Abstract

The distribution of data changes over time; models operating in dynamic environments need retraining. But knowing when to retrain, without access to labels, is an open challenge since some, but not all shifts degrade model performance. This paper formalizes and addresses the problem of post-deployment deterioration (PDD) monitoring. We propose D3M, a practical and efficient monitoring algorithm based on the disagreement of predictive models, achieving low false positive rates under non-deteriorating shifts and provides sample complexity bounds for high true positive rates under deteriorating shifts. Empirical results on both standard benchmark and a real-world large-scale internal medicine dataset demonstrate the effectiveness of the framework and highlight its viability as an alert mechanism for high-stakes machine learning pipelines.

Reliably Detecting Model Failures in Deployment Without Labels

TL;DR

This paper tackles the challenge of detecting post-deployment deterioration (PDD) without access to deployment labels by introducing D3M, a disagreement-driven monitoring framework. D3M operates in three stages—pre-training a feature extractor and a variational last layer to model a posterior predictive distribution, calibrating maximum disagreement on unlabeled in-distribution data, and deploying with a threshold on observed deployment disagreement—achieving low false positives for non-deteriorating shifts and provable, sample-efficient detection for deteriorating shifts. Theoretical guarantees (under certain assumptions) bound the false-positive rate and ensure positive detection power, while empirical results across UCI, CIFAR, Camelyon17, and GEMINI demonstrate practical viability in diverse modalities and real-world clinical settings. The work demonstrates that label-free, scalable monitoring of model degradation is achievable and can be integrated into high-stakes ML pipelines with strong performance guarantees. Overall, D3M offers a principled, scalable guardrail for production ML systems facing distribution drift without leaking training data or requiring ongoing access to labels.

Abstract

The distribution of data changes over time; models operating in dynamic environments need retraining. But knowing when to retrain, without access to labels, is an open challenge since some, but not all shifts degrade model performance. This paper formalizes and addresses the problem of post-deployment deterioration (PDD) monitoring. We propose D3M, a practical and efficient monitoring algorithm based on the disagreement of predictive models, achieving low false positive rates under non-deteriorating shifts and provides sample complexity bounds for high true positive rates under deteriorating shifts. Empirical results on both standard benchmark and a real-world large-scale internal medicine dataset demonstrate the effectiveness of the framework and highlight its viability as an alert mechanism for high-stakes machine learning pipelines.

Paper Structure

This paper contains 54 sections, 14 theorems, 58 equations, 13 figures, 12 tables, 2 algorithms.

Key Result

Lemma A.1

Assume that the ground truth at training and deployment are identical, i.e. $g = g'$, and that $\operatorname{TV}(\bm{P}, \bm{Q})\leq \kappa$, we have that when $\mathop{\mathrm{err}}\nolimits(f, \bm{Q}_h) - \mathop{\mathrm{err}}\nolimits(f, \bm{P}_h) \geq 2(\kappa + \epsilon)$, i.e. the disagreemen

Figures (13)

  • Figure 1: Overview of D3M.(1)Train: a feature extractor ($\operatorname{FE}$) and a Variational Bayesian Last Layer ($\operatorname{VBLL}$) are trained to model a posterior predictive distribution (PPD) over class logits. (2)Calibrate: disagreement statistics are computed by bootstrapping held-out ID datasets, sampling from the learned posteriors, and comparing sampled predictions to the base model’s outputs to collect a set of maximum disagreement rates $\Phi$. For illustrative purposes, agreements and disagreements between $\widehat{y}^{(3)}$ and $\Bar{y}$ are colored green and orange, respectively. (3)Deploy: at deployment, D3M monitors the model on incoming unlabeled data by computing the maximum disagreement rate $\Tilde{\phi}$ and flags deteriorating shift if $\Tilde{\phi} \geq \operatorname{Quantile}_{1-\alpha}(\Phi)$.
  • Figure 2: Performances in time evolving shifted test data from GEMINI. (a) Performance drop (bar plot) is small, thus a non-deteriorating shift is observed. (b) Time evolving shift monitoring. D3M is robust with small False Positive Rate (FPR) at level $\alpha = 0.05$.
  • Figure 3: Monitoring results on artificially shifted test data from the GEMINI dataset. (a) Performance drop (bar plot) is significant when the degree of shift is large ($0.0 \to 1.0$) (b) Results on different monitoring methods, D3M achieves competitive TPRs at level $\alpha = 0.05$.
  • Figure 4: Illustration of the FNR/FPR tradeoff and its remedy. The background color indicates the fixed ground truth. Positive and Negative points are from $\bm{P}_{g}$ (labeled) and the unlabeled points are from $\bm{Q}_{x}$. The solid black curve represents the deployed base classifier $f$. The dotted Pink ($h_1$) and Blue ($h_2$) curves represent the envelope boundary for $\mathcal{H}_p$ i.e., all the functions passing between these two curves are contained in $\mathcal{H}_p$. (a) Failure scenario (i.e, Regime 2) where D3M algorithm fails. (b) No deteriorating shift scenario. (c) Deteriorating shift and the D3M algorithm succeeds. In summery, a decreasing on $\epsilon_f$ could move the failure scenario to the solvable scenarios (a) or (b).
  • Figure 5: (Left) Random samples from CIFAR-10.1. (Right) Random samples from CIFAR-10's test set. The images above are borrowed from Recht et. al. "Do CIFAR-10 Classifiers Generalize to CIFAR-10?" recht2018cifar.
  • ...and 8 more figures

Theorems & Definitions (28)

  • Definition 1: Post-deployment deterioration, PDD
  • Definition 2: Disagreement based PDD (D-PDD)
  • Lemma A.1: Equivalence condition
  • proof
  • Definition 3: Deployed classifier error
  • Definition 4: $\epsilon_p, \epsilon_q$ maximum error in $\mathcal{H}_p$
  • Definition 5: $\xi$ quantifies D-PDD
  • Definition 6: $\eta$ error gap between $\mathcal{H}_p$ and Bayes optimal
  • Proposition A.2: D-PDD and $\mathop{\mathrm{TV}}\nolimits$ distance
  • Theorem A.3
  • ...and 18 more