Table of Contents
Fetching ...

Structure of Classifier Boundaries: Case Study for a Naive Bayes Classifier

Alan F. Karr, Zac Bowen, Adam A. Porter, Regina Ruane

TL;DR

This work investigates the boundary structure of a Bayes classifier whose inputs are discrete graph-structured DNA reads classified into three genomes. It introduces Neighbor Similarity (NS) as a practical, implementable surrogate for uncertainty, and shows the boundary can be large (up to ~30% of inputs) and geometrically complex. The authors connect NS to intrinsic uncertainty measures (max posterior and entropy), demonstrate relationships via quadratic and partition models, and provide strategies to locate and explore the boundary, including sampling-based approximations. The findings generalize across datasets and classifiers, offering a principled framework for uncertainty assessment in discrete, graph-based classification with implications for metagenomics and data quality.

Abstract

Classifiers assign complex input data points to one of a small number of output categories. For a Bayes classifier whose input space is a graph, we study the structure of the \emph{boundary}, which comprises those points for which at least one neighbor is classified differently. The scientific setting is assignment of DNA reads produced by \NGSs\ to candidate source genomes. The boundary is both large and complicated in structure. We introduce a new measure of uncertainty, Neighbor Similarity, that compares the result for an input point to the distribution of results for its neighbors. This measure not only tracks two inherent uncertainty measures for the Bayes classifier, but also can be implemented for classifiers without inherent measures of uncertainty.

Structure of Classifier Boundaries: Case Study for a Naive Bayes Classifier

TL;DR

This work investigates the boundary structure of a Bayes classifier whose inputs are discrete graph-structured DNA reads classified into three genomes. It introduces Neighbor Similarity (NS) as a practical, implementable surrogate for uncertainty, and shows the boundary can be large (up to ~30% of inputs) and geometrically complex. The authors connect NS to intrinsic uncertainty measures (max posterior and entropy), demonstrate relationships via quadratic and partition models, and provide strategies to locate and explore the boundary, including sampling-based approximations. The findings generalize across datasets and classifiers, offering a principled framework for uncertainty assessment in discrete, graph-based classification with implications for metagenomics and data quality.

Abstract

Classifiers assign complex input data points to one of a small number of output categories. For a Bayes classifier whose input space is a graph, we study the structure of the \emph{boundary}, which comprises those points for which at least one neighbor is classified differently. The scientific setting is assignment of DNA reads produced by \NGSs\ to candidate source genomes. The boundary is both large and complicated in structure. We introduce a new measure of uncertainty, Neighbor Similarity, that compares the result for an input point to the distribution of results for its neighbors. This measure not only tracks two inherent uncertainty measures for the Bayes classifier, but also can be implemented for classifiers without inherent measures of uncertainty.
Paper Structure (22 sections, 10 equations, 11 figures, 11 tables, 1 algorithm)

This paper contains 22 sections, 10 equations, 11 figures, 11 tables, 1 algorithm.

Figures (11)

  • Figure 1: Neighbor distributions for the 5869 reads, by source. The geometry is explained in the text.
  • Figure 2: ECDFs of Neighbor Similarity. Left: by read source. Center: by classifier decision. Right: by decision correctness. In each panel, the $x$-axis is Neighbor Similarity and the $y$-axis is cumulative probability.
  • Figure 3: Relative root mean squared error (RRMSE) in estimating $\hbox{NS}$ from samples of neighbors. The $x$-axis is the number of samples and the $y$-axis is RRMSE.
  • Figure 4: Scatterplots of $\hbox{MP}$ versus $\hbox{NS}$, by classifier decision. Left: decision = Adeno, Center: decision = COVID. Right: decision = SARS. In each panel, the $x$-axis is Neighbor Similarity and the $y$-axis is MP.
  • Figure 5: For the quadratic regression model, scatterplots of predicted $\hbox{MP}$ versus actual $\hbox{MP}$, by classifier decision. Left: decision = Adeno, Center: decision = COVID. Right: decision = SARS.
  • ...and 6 more figures