Structure of Classifier Boundaries: Case Study for a Naive Bayes Classifier
Alan F. Karr, Zac Bowen, Adam A. Porter, Regina Ruane
TL;DR
This work investigates the boundary structure of a Bayes classifier whose inputs are discrete graph-structured DNA reads classified into three genomes. It introduces Neighbor Similarity (NS) as a practical, implementable surrogate for uncertainty, and shows the boundary can be large (up to ~30% of inputs) and geometrically complex. The authors connect NS to intrinsic uncertainty measures (max posterior and entropy), demonstrate relationships via quadratic and partition models, and provide strategies to locate and explore the boundary, including sampling-based approximations. The findings generalize across datasets and classifiers, offering a principled framework for uncertainty assessment in discrete, graph-based classification with implications for metagenomics and data quality.
Abstract
Classifiers assign complex input data points to one of a small number of output categories. For a Bayes classifier whose input space is a graph, we study the structure of the \emph{boundary}, which comprises those points for which at least one neighbor is classified differently. The scientific setting is assignment of DNA reads produced by \NGSs\ to candidate source genomes. The boundary is both large and complicated in structure. We introduce a new measure of uncertainty, Neighbor Similarity, that compares the result for an input point to the distribution of results for its neighbors. This measure not only tracks two inherent uncertainty measures for the Bayes classifier, but also can be implemented for classifiers without inherent measures of uncertainty.
