Analysis of Diagnostics (Part I): Prevalence, Uncertainty Quantification, and Machine Learning

Paul N. Patrone; Raquel A. Binder; Catherine S. Forconi; Ann M. Moormann; Anthony J. Kearsley

Analysis of Diagnostics (Part I): Prevalence, Uncertainty Quantification, and Machine Learning

Paul N. Patrone, Raquel A. Binder, Catherine S. Forconi, Ann M. Moormann, Anthony J. Kearsley

TL;DR

This manuscript proposes a numerical, homotopy algorithm that estimates the relative probability level-sets of B stars by minimizing a prevalence-weighted empirical error and deduces the corresponding classifiers obey a useful monotonicity property that stabilizes the numerics and points to important extensions to UQ of ML.

Abstract

Diagnostic testing provides a unique setting for studying and developing tools in classification theory. In such contexts, the concept of prevalence, i.e. the number of individuals with a given condition, is fundamental, both as an inherent quantity of interest and as a parameter that controls classification accuracy. This manuscript is the first in a two-part series that studies deeper connections between classification theory and prevalence, showing how the latter establishes a more complete theory of uncertainty quantification (UQ) for certain types of machine learning (ML). We motivate this analysis via a lemma demonstrating that general classifiers minimizing a prevalence-weighted error contain the same probabilistic information as Bayes-optimal classifiers, which depend on conditional probability densities. This leads us to study relative probability level-sets $B^\star (q)$, which are reinterpreted as both classification boundaries and useful tools for quantifying uncertainty in class labels. To realize this in practice, we also propose a numerical, homotopy algorithm that estimates the $B^\star (q)$ by minimizing a prevalence-weighted empirical error. The successes and shortcomings of this method motivate us to revisit properties of the level sets, and we deduce the corresponding classifiers obey a useful monotonicity property that stabilizes the numerics and points to important extensions to UQ of ML. Throughout, we validate our methods in the context of synthetic data and a research-use-only SARS-CoV-2 enzyme-linked immunosorbent (ELISA) assay.

Analysis of Diagnostics (Part I): Prevalence, Uncertainty Quantification, and Machine Learning

TL;DR

Abstract

, which are reinterpreted as both classification boundaries and useful tools for quantifying uncertainty in class labels. To realize this in practice, we also propose a numerical, homotopy algorithm that estimates the

by minimizing a prevalence-weighted empirical error. The successes and shortcomings of this method motivate us to revisit properties of the level sets, and we deduce the corresponding classifiers obey a useful monotonicity property that stabilizes the numerics and points to important extensions to UQ of ML. Throughout, we validate our methods in the context of synthetic data and a research-use-only SARS-CoV-2 enzyme-linked immunosorbent (ELISA) assay.

Paper Structure (19 sections, 9 theorems, 47 equations, 6 figures)

This paper contains 19 sections, 9 theorems, 47 equations, 6 figures.

Appeal to the Reader
Introduction and Motivation
Mathematical Setting
Notation
Key Assumptions
Background Theory
A Motivating Lemma and Two Postulates
Classification Without PDFs
General Formulation of the Classification Method
Quadratic Approximation
Validation
Synthetic Data
A SARS-CoV-2 ELISA Assay
Uncertainty Level-Sets
Theoretical Issues
...and 4 more sections

Key Result

Lemma 2.7

Let $C(\omega)$ be a binary random variable and assume $P(\boldsymbol {\rm r})$ and $N(\boldsymbol {\rm r})$ are known. Let $D \subset \Gamma$ be a subdomain chosen such that the difference of measures $P_D$ and $N_D$ is non-zero; i.e. $|P_D-N_D|>0$. Let there be $s$iid random variables $\boldsymbol is an unbiased estimate of $q$ that converges in mean-square as $s\to \infty$, where $\mathbb I$ is

Figures (6)

Figure 1: Representative output of an antibody assay. Each point corresponds to a random variable $\boldsymbol {\rm r}(\omega)$ whose underlying sample point $\omega$ is an individual that donated blood. The $\boldsymbol {\rm r}(\omega)$ are measurement outcomes of a diagnostic test that quantifies the amount of two types of antibodies that bind to different parts of the SARS-CoV-2 virus. The horizontal axis is the scale for the dimensionless mean fluorescence intensity (MFI) measurement that quantifies the amount of receptor-binding-domain (RBD) immunoglobulin g (IgG) antibodies in each sample. The vertical axis sets the corresponding scale for nucleocapsid (N) IgG antibodies. In a typical diagnostic setting, the underlying true class $C(\omega)$ of the individual is unknown. In the figure, however, extra information allows us to classify the data as either negative (blue o) or positive (red x). Thus, this dataset is an example of pure training data; cf. Definition \ref{['def:pure']}. See Ref. Raquel1 for additional details of this data.
Figure 1: Left: Example of 2000 dimensionless, synthetic data points and analysis thereof. Positive and negative data-points were generated according to the distributions given by Eqs. \ref{['eq:PandN']}. The estimated classification boundary is given by solving Eq. \ref{['eq:homotopy']}. The inset shows the average Frobenius norm $|| {\boldsymbol{\rm A}}^\star - {\boldsymbol{\rm A}}_{s+1}^\star||^2$ as a function of the number of sample points $\mathcal{S}$. Note that the norm displays an approximate $1/S$ behavior consistent with convergence in mean-square. Right: Illustration of the Homotopy Classifier; see Construction \ref{['constr:homotopy']}. Data was generated from the same distributions as in the left plot. However, the true prevalence was $q=0.15$. While the true classes of the datapoints are indicated in the figure, all data was combined and treated as test data for the purposes of this example. All optimizations were done using a training population constructed in the same manner as the left plot. Moreover, the same initial conditions and homotopy parameters were used to determine $\hat{q}$ (estimated to be $0.1525$ for the data shown) for the test population, after which the classification boundary was optimized using the empirical prevalence estimate. The inset shows a histogram of $\hat{q}$ values associated with repeating the this numerical experiment 100 times.
Figure 1: Left: Example of optimal classification boundaries that cross. Each boundary was computed by solving the optimization problem associated with Eq. \ref{['eq:objectivesum']}, assuming a boundary described by the quadratic model. Note that the classification boundaries cross. Thus, the point labeled "W" (black dot) has an ambiguously defined classification accuracy. Specifically, it must simultaneously satisfy the inequalities $\alpha(\boldsymbol {\rm r}_{\rm W})>0.1$, $\alpha(\boldsymbol {\rm r}_{\rm W})<0.3$, and $\alpha(\boldsymbol {\rm r}_{\rm W})>0.9$, which is not possible. Right: Example of classification boundaries computed according to Def. \ref{['def:lsof']}. We computed classification boundaries for $q=0.05,0.1,...,0.95$ and used 100 uniformly spaced shadow points, which are indicated by faint black diamonds. We also included a shadow point at the coordinate (-1.91,-2.25) to eliminate overlapping in that region. Note that $q$ increases going from the upper-right to lower-left.
Figure 2: Left: Classification of 2D SARS-CoV-2 training data for $q=268/460$, the true prevalence of the dataset. Negative samples were collected before the pandemic. Positive samples were collected before the release of SARS-CoV-2 vaccines; thus, RBD is still a meaningful determinant of a positive sample. The initial classification boundary is a straight line determined according to Eq. \ref{['eq:ao']}. The boundaries associated with the first two iterations of Eq. \ref{['eq:homotopy']} are shown in dotted and dash-dot lines. By the third iteration, the homotopy method has converged to a solution. The corresponding value of $\sigma^2$ corresponds to $|\nu|/1000$. Right: Example of optimization without using the homotopy method. The initial guess of the optimization was the same as in the left plot. We directly set $\sigma^2=10^{-8}$ and performed a single optimization via Eq. \ref{['eq:homotopy']}. The resulting positive classification domain poorly separates positive and negative populations.
Figure 3: Illustration of how the homotopy method stabilizes optimization of the classification boundary. We seek to find a conic section that best separates the positive and negative populations. However, attempting to work directly with the empirical data is like threading a needle; there is a significant chance that too many data points will end up on the wrong side of the boundary. See Fig. \ref{['fig:sarsdata']}, for example. Thus, we blur out the data to temporarily diminish the significance of individual points. Intuitively, this regularizes each iteration of the optimization by finding the surface that best separates the blue and yellow shaded areas. The degree of blurring is proportional to $\sigma^2$. Going left to right and top to bottom (A to D), the four values of $\sigma^2$ are $\sigma^2_1=1$, $\sigma^2_2=0.25$, $\sigma^2_3=0.1$, and $\sigma^2_4=0.05$. The final boundary computed for $\sigma^2_j$ is used as the initial guess in the optimization associated with $\sigma^2_{j+1}$. The color scale is created by convolving the empirical distributions for positive and negative samples with a Gaussian probability density function having a standard deviation $\sigma$. This is done for illustrative purposes and does not reflect the specifics of how data is blurred in Eq. \ref{['eq:homotopy']}. Note that as $\sigma\to 0$, the objective function becomes the empirical classification error, which is the quantity we wish to minimize.
...and 1 more figures

Theorems & Definitions (54)

Example 2.1: Antibody Assays
Remark 2.2
Definition 2.3: Prevalence Convention
Definition 2.4: PDF Conventions
Definition 2.5: PDF of a Binary Test Population
Remark 2.6
Lemma 2.7: Class-Agnostic Prevalence Estimator
Remark 2.8
Definition 2.9
Remark 2.10: Equivalence Class of Partitions
...and 44 more

Analysis of Diagnostics (Part I): Prevalence, Uncertainty Quantification, and Machine Learning

TL;DR

Abstract

Analysis of Diagnostics (Part I): Prevalence, Uncertainty Quantification, and Machine Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (54)