Suitability Filter: A Statistical Framework for Classifier Evaluation in Real-World Deployment Settings

Angéline Pouget; Mohammad Yaghini; Stephan Rabanser; Nicolas Papernot

Suitability Filter: A Statistical Framework for Classifier Evaluation in Real-World Deployment Settings

Angéline Pouget, Mohammad Yaghini, Stephan Rabanser, Nicolas Papernot

TL;DR

The paper proposes the suitability filter, a modular statistical framework for detecting degradation in classifier performance on unlabeled deployment data by comparing the estimated accuracy on $D_u$ to the labeled accuracy on $D_{ ext{test}}$ within a margin $m$ using non-inferiority testing. It combines a per-sample correctness estimator $C$, generated from domain-agnostic suitability signals, with calibration considerations and a Welch's $t$-test to yield a dataset-level SUITABLE/INCONCLUSIVE decision. The approach is grounded in theoretical guarantees for false-positive control under $\delta$-calibration, and is empirically validated on multiple WILDS benchmarks (FMoW-WILDS, RxRx1-WILDS, CivilComments-WILDS) across various covariate shifts, demonstrating reliable detection of deterioration and practical deployment insights. The work aims to enable auditable SLAs and safer, more trustworthy deployment by offering a label-free mechanism to monitor model performance in dynamic environments.

Abstract

Deploying machine learning models in safety-critical domains poses a key challenge: ensuring reliable model performance on downstream user data without access to ground truth labels for direct validation. We propose the suitability filter, a novel framework designed to detect performance deterioration by utilizing suitability signals -- model output features that are sensitive to covariate shifts and indicative of potential prediction errors. The suitability filter evaluates whether classifier accuracy on unlabeled user data shows significant degradation compared to the accuracy measured on the labeled test dataset. Specifically, it ensures that this degradation does not exceed a pre-specified margin, which represents the maximum acceptable drop in accuracy. To achieve reliable performance evaluation, we aggregate suitability signals for both test and user data and compare these empirical distributions using statistical hypothesis testing, thus providing insights into decision uncertainty. Our modular method adapts to various models and domains. Empirical evaluations across different classification tasks demonstrate that the suitability filter reliably detects performance deviations due to covariate shift. This enables proactive mitigation of potential failures in high-stakes applications.

Suitability Filter: A Statistical Framework for Classifier Evaluation in Real-World Deployment Settings

TL;DR

The paper proposes the suitability filter, a modular statistical framework for detecting degradation in classifier performance on unlabeled deployment data by comparing the estimated accuracy on

to the labeled accuracy on

within a margin

using non-inferiority testing. It combines a per-sample correctness estimator

, generated from domain-agnostic suitability signals, with calibration considerations and a Welch's

-test to yield a dataset-level SUITABLE/INCONCLUSIVE decision. The approach is grounded in theoretical guarantees for false-positive control under

-calibration, and is empirically validated on multiple WILDS benchmarks (FMoW-WILDS, RxRx1-WILDS, CivilComments-WILDS) across various covariate shifts, demonstrating reliable detection of deterioration and practical deployment insights. The work aims to enable auditable SLAs and safer, more trustworthy deployment by offering a label-free mechanism to monitor model performance in dynamic environments.

Suitability Filter: A Statistical Framework for Classifier Evaluation in Real-World Deployment Settings

TL;DR

Abstract

Suitability Filter: A Statistical Framework for Classifier Evaluation in Real-World Deployment Settings

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (9)