Table of Contents
Fetching ...

Zero-failure testing of binary classifiers

Ioannis Ivrissimtzis, Matthew Houliston, Shauna Concannon, Graham Roberts

TL;DR

The paper tackles asymmetric error costs in binary classification by introducing zero-failure testing, where the operating point is chosen to guarantee zero misclassifications on the positive (under-threshold) set, and performance is measured by TNR on negatives. It demonstrates a formal framework for zero-failure tests, including a binomial/Bayesian interpretation and the ability to build nested test sets of increasing difficulty. The authors illustrate the method on age-threshold problems with synthetic data, Morph2-based CORAL-CNN and OR-CNN comparisons, and human-estimation data from appa-real, highlighting design considerations and the impact of outliers. They argue for curated acceptance tests and requirement specifications to separate testing quality from training data and discuss future work on bias and broader deployments, including regulatory certification implications.

Abstract

We propose using performance metrics derived from zero-failure testing to assess binary classifiers. The principal characteristic of the proposed approach is the asymmetric treatment of the two types of error. In particular, we construct a test set consisting of positive and negative samples, set the operating point of the binary classifier at the lowest value that will result to correct classifications of all positive samples, and use the algorithm's success rate on the negative samples as a performance measure. A property of the proposed approach, setting it apart from other commonly used testing methods, is that it allows the construction of a series of tests of increasing difficulty, corresponding to a nested sequence of positive sample test sets. We illustrate the proposed method on the problem of age estimation for determining whether a subject is above a legal age threshold, a problem that exemplifies the asymmetry of the two types of error. Indeed, misclassifying an under-aged subject is a legal and regulatory issue, while misclassifications of people above the legal age is an efficiency issue primarily concerning the commercial user of the age estimation system.

Zero-failure testing of binary classifiers

TL;DR

The paper tackles asymmetric error costs in binary classification by introducing zero-failure testing, where the operating point is chosen to guarantee zero misclassifications on the positive (under-threshold) set, and performance is measured by TNR on negatives. It demonstrates a formal framework for zero-failure tests, including a binomial/Bayesian interpretation and the ability to build nested test sets of increasing difficulty. The authors illustrate the method on age-threshold problems with synthetic data, Morph2-based CORAL-CNN and OR-CNN comparisons, and human-estimation data from appa-real, highlighting design considerations and the impact of outliers. They argue for curated acceptance tests and requirement specifications to separate testing quality from training data and discuss future work on bias and broader deployments, including regulatory certification implications.

Abstract

We propose using performance metrics derived from zero-failure testing to assess binary classifiers. The principal characteristic of the proposed approach is the asymmetric treatment of the two types of error. In particular, we construct a test set consisting of positive and negative samples, set the operating point of the binary classifier at the lowest value that will result to correct classifications of all positive samples, and use the algorithm's success rate on the negative samples as a performance measure. A property of the proposed approach, setting it apart from other commonly used testing methods, is that it allows the construction of a series of tests of increasing difficulty, corresponding to a nested sequence of positive sample test sets. We illustrate the proposed method on the problem of age estimation for determining whether a subject is above a legal age threshold, a problem that exemplifies the asymmetry of the two types of error. Indeed, misclassifying an under-aged subject is a legal and regulatory issue, while misclassifications of people above the legal age is an efficiency issue primarily concerning the commercial user of the age estimation system.
Paper Structure (12 sections, 3 equations, 5 figures, 2 tables)

This paper contains 12 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Top Histograms of the actual ages of the subjects. The age range 12-17 is shown in blue, the range 18-50 in orange. Bottom: Histograms of the estimated ages of the subjects. The estimates for the subjects in the age range 12-17 are shown in blue, the estimates for the subjects in the range 18-50 in orange. Left to right: Confidence - reliability pairs of (.95, .95), (.95, .995), and (.95, .998).
  • Figure 2: Zero-failure testing as the intersection of the family of tests that allow a fixed number of failures (top) and the family of tests that allow a fixed ratio of failures (bottom).
  • Figure 3: The nested test sets $\hbox{zFail-60} \subset \hbox{zFail-200} \subset \hbox{zFail-600} \subset \hbox{zFail-1550}$. In the CORAL-CNN, seed-1 classifier, the highest age estimate on zFail 1550, depicted with a red dot, comes from a sample in its subset zFail-600.
  • Figure 4: Shown in red, the histogram of the actual ages of the 215 subject in the validation set of the appa-real database in the 6-17 age range. Shown in blue, the histogram of the average age estimates by human estimators.
  • Figure 5: The nine subjects in the validation set of the appa-real database with actual age between 6 and 17 and average human estimated age higher than 25.