Zero-failure testing of binary classifiers
Ioannis Ivrissimtzis, Matthew Houliston, Shauna Concannon, Graham Roberts
TL;DR
The paper tackles asymmetric error costs in binary classification by introducing zero-failure testing, where the operating point is chosen to guarantee zero misclassifications on the positive (under-threshold) set, and performance is measured by TNR on negatives. It demonstrates a formal framework for zero-failure tests, including a binomial/Bayesian interpretation and the ability to build nested test sets of increasing difficulty. The authors illustrate the method on age-threshold problems with synthetic data, Morph2-based CORAL-CNN and OR-CNN comparisons, and human-estimation data from appa-real, highlighting design considerations and the impact of outliers. They argue for curated acceptance tests and requirement specifications to separate testing quality from training data and discuss future work on bias and broader deployments, including regulatory certification implications.
Abstract
We propose using performance metrics derived from zero-failure testing to assess binary classifiers. The principal characteristic of the proposed approach is the asymmetric treatment of the two types of error. In particular, we construct a test set consisting of positive and negative samples, set the operating point of the binary classifier at the lowest value that will result to correct classifications of all positive samples, and use the algorithm's success rate on the negative samples as a performance measure. A property of the proposed approach, setting it apart from other commonly used testing methods, is that it allows the construction of a series of tests of increasing difficulty, corresponding to a nested sequence of positive sample test sets. We illustrate the proposed method on the problem of age estimation for determining whether a subject is above a legal age threshold, a problem that exemplifies the asymmetry of the two types of error. Indeed, misclassifying an under-aged subject is a legal and regulatory issue, while misclassifications of people above the legal age is an efficiency issue primarily concerning the commercial user of the age estimation system.
