Table of Contents
Fetching ...

Sequential Harmful Shift Detection Without Labels

Salim I. Amoukou, Tom Bewley, Saumitra Mishra, Freddy Lecue, Daniele Magazzeni, Manuela Veloso

TL;DR

This work tackles detecting harmful distribution shifts in continuous production without access to ground-truth labels by using a plug-in error estimator $\hat{r}$ to proxy the true error $E$, followed by calibration over empirical quantiles $(q,\hat{q})$ to identify high-error observations. A sequential testing framework based on time-uniform confidence bounds constructs lower and upper bounds $\hat{L}_q$ and $\hat{U}_q$ (or $\hat{U}_q^2$) to raise alarms when the estimated harmful-shift risk exceeds the baseline by a tolerance $\epsilon_{tol}$, with false-alarm control at level $\alpha_{source}+\alpha_{prod}$ under a mild assumption. Empirical results across CelebA, synthetic tabular shifts (California housing, Bike Sharing, HELOC, NHANES), and Folktables demonstrate that the proposed quantile-based detector achieves favorable power-FDP trade-offs and robust early detection, even when the error estimator is imperfect. The approach offers practical, online monitoring for production systems that cannot access immediate labels, enabling timely interventions without compromising false-alarm rates. Overall, the paper provides a principled, label-free framework for sequential harmful shift detection with theoretical guarantees and broad empirical validation.

Abstract

We introduce a novel approach for detecting distribution shifts that negatively impact the performance of machine learning models in continuous production environments, which requires no access to ground truth data labels. It builds upon the work of Podkopaev and Ramdas [2022], who address scenarios where labels are available for tracking model errors over time. Our solution extends this framework to work in the absence of labels, by employing a proxy for the true error. This proxy is derived using the predictions of a trained error estimator. Experiments show that our method has high power and false alarm control under various distribution shifts, including covariate and label shifts and natural shifts over geography and time.

Sequential Harmful Shift Detection Without Labels

TL;DR

This work tackles detecting harmful distribution shifts in continuous production without access to ground-truth labels by using a plug-in error estimator to proxy the true error , followed by calibration over empirical quantiles to identify high-error observations. A sequential testing framework based on time-uniform confidence bounds constructs lower and upper bounds and (or ) to raise alarms when the estimated harmful-shift risk exceeds the baseline by a tolerance , with false-alarm control at level under a mild assumption. Empirical results across CelebA, synthetic tabular shifts (California housing, Bike Sharing, HELOC, NHANES), and Folktables demonstrate that the proposed quantile-based detector achieves favorable power-FDP trade-offs and robust early detection, even when the error estimator is imperfect. The approach offers practical, online monitoring for production systems that cannot access immediate labels, enabling timely interventions without compromising false-alarm rates. Overall, the paper provides a principled, label-free framework for sequential harmful shift detection with theoretical guarantees and broad empirical validation.

Abstract

We introduce a novel approach for detecting distribution shifts that negatively impact the performance of machine learning models in continuous production environments, which requires no access to ground truth data labels. It builds upon the work of Podkopaev and Ramdas [2022], who address scenarios where labels are available for tracking model errors over time. Our solution extends this framework to work in the absence of labels, by employing a proxy for the true error. This proxy is derived using the predictions of a trained error estimator. Experiments show that our method has high power and false alarm control under various distribution shifts, including covariate and label shifts and natural shifts over geography and time.

Paper Structure

This paper contains 20 sections, 3 theorems, 25 equations, 12 figures, 3 tables.

Key Result

Theorem 4.2

Under Assumption assumption:fdr_prod, $\hat{L}_q$ and $\hat{U}_q$ satisfy Equations eq:lower_bound and eq:upper_bound_quantile. Therefore, the function $\Phi_q$ has false alarm control, i.e.,

Figures (12)

  • Figure 1: Overview of the proposed approach. Left: calibrating an estimated error threshold to separate low/high true errors. Right: sequentially tracking production data exceeding the threshold and raising an alarm upon a significant increase.
  • Figure 2: Calibration toy example. Left: threshold grid created by sweeping $p\in[0.5,0.95]$ at increments of $0.05$ and $\hat{p}\in[0.1,0.9]$ at increments of $0.1$. Middle: FDP of selector for each $(p, \hat{p})$ pair. Black outline indicates pairs for which FDP $< 0.2$. Right: selector power for each $(p, \hat{p})$ pair. Green dotted outline indicates the pair that maximises power subject to the FDP $< 0.2$ limit. Corresponding thresholds $(q,\hat{q})$ shown as thick lines in left plot.
  • Figure 3: Selector FDP (left) and power (right) vs estimator accuracy. Results on source data in blue; results on production data in red.
  • Figure 4: Evolution of bounds in production for mean detector (left) and quantile detector (right).
  • Figure 5: Left: Power/FDP when $\epsilon_{tol}=0$ for all datasets. Middle: Absolute detection time difference vs. the methods using true errors. Right: Power values for different harmfulness thresholds ($\epsilon_{tol}$).
  • ...and 7 more figures

Theorems & Definitions (4)

  • Theorem 4.2
  • Theorem B.1
  • Theorem C.1
  • proof