Structure of Classifier Boundaries: Case Study for a Naive Bayes Classifier

Alan F. Karr; Zac Bowen; Adam A. Porter; Regina Ruane

Structure of Classifier Boundaries: Case Study for a Naive Bayes Classifier

Alan F. Karr, Zac Bowen, Adam A. Porter, Regina Ruane

TL;DR

This work investigates the boundary structure of a Bayes classifier whose inputs are discrete graph-structured DNA reads classified into three genomes. It introduces Neighbor Similarity (NS) as a practical, implementable surrogate for uncertainty, and shows the boundary can be large (up to ~30% of inputs) and geometrically complex. The authors connect NS to intrinsic uncertainty measures (max posterior and entropy), demonstrate relationships via quadratic and partition models, and provide strategies to locate and explore the boundary, including sampling-based approximations. The findings generalize across datasets and classifiers, offering a principled framework for uncertainty assessment in discrete, graph-based classification with implications for metagenomics and data quality.

Abstract

Classifiers assign complex input data points to one of a small number of output categories. For a Bayes classifier whose input space is a graph, we study the structure of the \emph{boundary}, which comprises those points for which at least one neighbor is classified differently. The scientific setting is assignment of DNA reads produced by \NGSs\ to candidate source genomes. The boundary is both large and complicated in structure. We introduce a new measure of uncertainty, Neighbor Similarity, that compares the result for an input point to the distribution of results for its neighbors. This measure not only tracks two inherent uncertainty measures for the Bayes classifier, but also can be implemented for classifiers without inherent measures of uncertainty.

Structure of Classifier Boundaries: Case Study for a Naive Bayes Classifier

TL;DR

Abstract

Paper Structure (22 sections, 10 equations, 11 figures, 11 tables, 1 algorithm)

This paper contains 22 sections, 10 equations, 11 figures, 11 tables, 1 algorithm.

Introduction
Experimental Setting
Classifiers
Mathematical Preliminaries
The Three Genomes and the Reads Dataset
The Bayes Classifier
Key Concepts
The Boundary
Surrogate Uncertainty Measures
Results
DNA Reads
Relationship of Neighbor Similarity to Inherent Measures of Uncertainty
Exploring the Boundary
Do Boundary Points Differ from Other Reads?
Is the Boundary Connected?
...and 7 more sections

Figures (11)

Figure 1: Neighbor distributions for the 5869 reads, by source. The geometry is explained in the text.
Figure 2: ECDFs of Neighbor Similarity. Left: by read source. Center: by classifier decision. Right: by decision correctness. In each panel, the $x$-axis is Neighbor Similarity and the $y$-axis is cumulative probability.
Figure 3: Relative root mean squared error (RRMSE) in estimating $\hbox{NS}$ from samples of neighbors. The $x$-axis is the number of samples and the $y$-axis is RRMSE.
Figure 4: Scatterplots of $\hbox{MP}$ versus $\hbox{NS}$, by classifier decision. Left: decision = Adeno, Center: decision = COVID. Right: decision = SARS. In each panel, the $x$-axis is Neighbor Similarity and the $y$-axis is MP.
Figure 5: For the quadratic regression model, scatterplots of predicted $\hbox{MP}$ versus actual $\hbox{MP}$, by classifier decision. Left: decision = Adeno, Center: decision = COVID. Right: decision = SARS.
...and 6 more figures

Structure of Classifier Boundaries: Case Study for a Naive Bayes Classifier

TL;DR

Abstract

Structure of Classifier Boundaries: Case Study for a Naive Bayes Classifier

Authors

TL;DR

Abstract

Table of Contents

Figures (11)